Polish Y

This web page was not updated between Oct 2012 and Aug 2013. Some of the topics were updated in early August 2013.

Update 15 Feb 2012 based on data from the Polish Project, taken to be representative of Historical Poland. The % column shows the percentage for each clade in the Polish Project.

In the Short Code Name column, click on the link to jump down to a discussion of that clade.

Haplogroup	SNP	Proposed Sub - Clade	Short Code Name	%	Concentration
I1	M253	I1 P type	I-P	0.6	Poland
N1c1d1	L551		N-G	0.6	Lithuania
N1c1d2	L591		N-M	0.6	Lithuania
R1a1a1g1*	M458+ L260-	N type	N	8.8	Eastern Europe
R1a1a1g1*	M458+ L260-	Np cluster	Np	1.1	Poland
R1a1a1g1b	L260		P	8.9	Poland
R1a1a1g2	Z280	B type	B	2.7	Eastern Europe
R1a1a1g2	Z280	I type	I	2.8	Eastern Europe
R1a1a1g2	Z280	K type	K	3.6	Eastern Europe
R1a1a1g2a	P278.2	H type	H	0.9	Eastern Europe
R1a1a1g2b	L365	G type	G	1.8	Pomerania
R1a1a1ig2d	Z92	E type	E	2.1	Eastern Europe
R1a1a1h1	L342.2	A type	A	1.7	Ashkenazi
R1b1a2a	L23	R1b EE type	R1b-EE	1.3	Eastern Europe
R1b1a2a1a1a5b2a	L47	R1b A type	R1b-A	0.9	Ashkenazi
R1b1a2a1a1a5b2a	L47	R1b P type	R1b-P	0.6	Poland

The table above is a brief summary, with some interesting results, and some recent results. For more results, please click on the following links:

The Polish Project has assignments of men (samples) to haplogroups and to proposed subdivision clades based on their Y-DNA data. Lawrence Mayka, administrator of the Polish Project, provides data for this web site of mine. I help Mayka with statistical methods for assignment of samples. This web document is for explanation, details, and update news.

The Results Table has a summary of assignment statistics. Some assignment categories have a link to more detailed discussion. If you know your assignment you can click on the link in the right column of the Results Table to read more about your assignment category.

Many of the assignments are to well established haplogroups, based on SNPs. Some assignments are to hypothetical haplogroup branches, based on STRs. Such branches are proposed by many people, including Mayka and me. In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence. With less than 80% confidence, my assignment categories are tentative, not called types, usually called clusters.

About half of Polish men belong to haplogroup R1a. Most of my work has been on R1a. The R1a Project has lots of additional information.

This Abstract is for people reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you might prefer to read the Introduction first.

This web document has three purposes: 1. More detailed explanations for the sample assignments in the Polish Project. 2. Summary of my published results. 3. Update with recent results.

The topic is common Polish Y-DNA clades - identification of male line Y-DNA clades that are concentrated in the region of Historical Poland.

I use the word type to mean an STR cluster with statistical confidence as established by my Mountain Method. Many of my types have been validated by discovery of new SNPs that qualified the corresponding clades as official haplogroups. I expect more than 80% of my types to be validated some day, but my method is intended to be slightly aggressive, so I do not expect 90% validity. I chose the word “type” because it is not generally used in genetic genealogy and I wish to distinguish my types from haplogroups and from other clusters. All types have associated clusters but not all clusters qualify as types. In my publications and web pages I make it clear which types I have discovered in web data and which types were suggested to me by others, with references. Often when I discover a type I later find out someone else had mentioned it earlier on the web; let me know if you the reader have more clues and references for me.

Most types that I discuss seem to be 1,000 to 5,000 years old, so all the men in each type seem to be descended in direct male lines from one man (MRCA) who lived that long ago (TMRCA). A few of my types might be younger or older than that range.

I use phrases like “seem to be” over and over because the methods are statistical.

The Polish Project is considered representative of Historical Poland, with caveats explained in my Publication.

I am interested in Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal. This document is dedicated to identifying haplogroups and types and clusters concentrated in Poland, with detailed explanations. I am aware that some people object to the use of Y-DNA for historical analysis, so I try to mention caveats along with my comments.

About half of Polish men belong to haplogroup R1a. The R1a Project has lots of additional information about that haplogroup.

When I originally posted this web page in December 2007, no significant haplogroup subdivision of R1a was available, so this page started with hypothetical subdivisions of R1a. A major division, roughly 50-50, based on the SNP M458, became available in November 2009. Now, 2013, there are many haplogroup branches known in R1a, and this page continues with proposed further division of Y-DNA clades common in the region of Historical Poland.

Actually, the largest category in the Polish Project is the R1a - U category, for “Unassigned” samples without sufficient data for confident assignment to R1a branches. The Results Table is based upon the samples with sufficient data. If you are in this R1a-U category, you can promote yourself into one of the branches by purchasing the full 67 marker STR set, since all R1a samples with 67 markers get a detailed assignment.

There are two large categories in the Polish R1a data. Since 2007, I have been calling them P type and N type. P type is now known to be more than 98% equivalent to the haplogroup R1a1a1b1a1a(L260). N type is more than 95% equivalent to the paragroup R1a1a1b1a1 (M458+L260-). P type is concentrated in Poland, rare with increasing distance from Poland. N type seems to be mostly Slavic, widespread in eastern Europe.

Since 2007, I had been calling another major R1a Polish category K type. Over the years I had subdivided K into several smaller types and clusters, although I did not have high confidence that all of them in fact belonged to a single unique clade, as discussed at this web page over the years. My K group is now known to be a mix of independent haplogroups, so the Polish Project no longer uses K as a category, although quite a few small clusters with names such as Kx and Kz are still predicted. The various K categories are now clusters, types, and confirmed haplogroups within the two major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93). K type is being removed from this web pages with updates.

Another large R1a clade, the one I call L type, is very rare in the Polish Project. It is common in Scandinavia, and now known as R1a1a1b1a3 (Z284).

Thanks go to Lawrence Mayka, Polish Project administrator, for extensive email information and assistance.

You can compare data to my types by clicking this link to instructions for Ysearch.

Reminder: I am concentrating on Poland. The statistics of STR clusters depend a lot on the database. For example, P type stands out dramatically in Polish data. In other countries far from Poland P type is rare. If you belong to an R1a cluster that is rare in Poland, I’m sorry, but I’m not covering you. Check out the R1a Project.

Recent updated graphical representations of the full R1a tree are available at the R1a Project and at Russian sites, for example Semargl.

This Introduction is for people unfamiliar with the jargon of genetic genealogy.

There are quite a few web sites with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JOGG are good general references. The Y Chromosome Wikipedia article is about male line DNA, also called Y-DNA.

The following several paragraphs are a brief introduction to genetic genealogy for Y-DNA, providing some definitions of jargon needed to read my web pages. The definition words are boldface. I often use links to those definitions when I use a jargon word for the first time in a topic. There are more boldface definitions in the summary of my Methods.

The Y chromosome gets passed from father to son, so it works just like a male family name. Men are divided into haplogroups based on known rare mutations (most of them are called single nucleotide polymorphisms SNP) in the Y chromosome. Division into haplogroups is done in a manner that has virtually 100% confidence. I say “virtually” because your confidence in your DNA result from your DNA testing company might be 98% or 99% or 99.9%; the confidence for haplogroups is better than that. I other words, we are more confident in the validity of the haplogroups than in the accuracy of individual DNA tests. We can be virtually certain that all the men in a haplogroup descend in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup. The MRCA corresponds to a node, or branching point, in the Y-DNA tree of male line ancestry. Time of the Most Recent Common Ancestor (TMRCA) is an estimate of how long ago he lived - the age of the node.

Lots of people, including me, are working to discover more SNPs on the Y chromosome so that the haplogroups can be divided further into smaller haplogroups.

Haplogroups have alphanumeric codes, like R1a1a. A paragroup is a haplogroup considered without its known haplogroup branches. An asterisk is often used in paragroup codes, like R1a1a*. When a new branch is discovered within a paragroup, it gets removed from the definition; that changes the meaning of that paragroup. The meaning of a paragroup varies at different web sites, depending upon which branches are used in the associated database.

Many people, including me, try to “stay ahead” of the haplogroups by analyzing other mutations that are not so rare (called STR) on the Y chromosome. Men submit their Y-DNA data to various web sites. There are lots of STR data available on the web. Men are divided into STR clusters as hypothetical subdivisions of the haplogroups, based on similarities of STR values. All such clusters are hypothetical. Some will be validated in the future by new SNP discoveries. There are various statistical methods for estimating the confidence of STR clusters. I recently published a method that I developed. That publication has references to other methods. There is a brief summary of my method below.

Y-DNA is biologically accurate, so some men discover that their Y-DNA does not match the DNA of their male line cousins identified by genealogy research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people prefer to avoid genetic genealogy.

The male line associated with the Y-chromosome is only one ancestral line. Anyone who tries to make a family tree going back 300 years has more than a thousand root tips to be filled by names of ancestors who lived back then; the one man at the tip of the male line root is only one of those thousand. That is another reason some genealogists avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of many. That said, many people enjoy the challenging hobby of figuring out to which ancient extended male line they belong.

Most STR based clusters have an MRCA who lived thousands of years ago, before family names were common, so most men assigned to a typical cluster do not have the same family name.

Many SNP based haplogroups have an MRCA who lived more than ten thousand years ago, so these span multiple ethnic groups and nationalities. For example, the R1a haplogroup is of interest to me. R1a is most common in Slavic countries but calling R1a Slavic is misleading because it is found throughout Europe and west Asia. The MRCA lived so long ago that he may have spoken a language that we would not consider Slavic if we could hear it. It is possible that he did not even live in what is now the Slavic region of Europe; maybe his descendants moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and culture, by now some of his descendants long ago moved to other parts of Europe and Asia. One of the appeals of genetic genealogy is trying to figure out ethnic descent and migration from the statistics of haplogroups. Some people object, pointing out that ethnicity cannot be defined genetically because of all the moving and mixing of people over the millennia, and because the Y chromosome is only one of our 24 chromosomes. True enough. Some individuals and some web sites go too far with genetic genealogy claims based on DNA. That said, statistical analysis of haplogroup data provides many clues on human origins.

Again, some people try to stay ahead of haplogroups, using statistical analysis of STR based clusters to gain insight into more recent human origins. I am one of those people. My interest is Polish origins. This web document, however, is not for historical analysis and conclusions, except for occasional comments to remind us of the goal. This document is dedicated to Y-DNA data and analysis, both SNP and STR, identifying haplogroups, types, and clusters concentrated in Poland, with detailed explanations.

The bottom of my Method section has more definitions for a number of genetic genealogy terms.

There are many organizations and commercial companies on the web where you can order a cheek swab kit to mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company FTDNA; I mention them because I make extensive use of their data; check Google for competitors. At FTDNA, click on Products for cheek swab kits. DNA results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your data into one of the many databases; for example, most of my analysis is from the data in the FTDNA Polish Project.

I use the FTDNA standard set of 67 STR markers (plus a few non-standard ones occasionally). I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets. Other companies use standard marker sets that may not overlap with all the FTDNA markers.

Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch.

L1029 was a new SNP last March. L1029 provides a branch of M458, added to the ISOGG tree this year. The other branch is L260 (update next topic). L260 was discovered in 2010. Most M458+ L260- samples are coming out L1029+. I have been calling M458+ L260- samples N type (very few exceptions - next topic). It is now clear that L1029 is a major branch, capturing more than 90% of N type (more than 90% of M458+ L260-).

In the Polish Project, most of the N type L1029- results are samples with Poland given as the ancestral country. This spring, Mayka started classifying these as the “Np” cluster.

In this topic I present preliminary evidence that Np corresponds to a Y-DNA clade concentrated in Poland. I also explain why all Polish N type samples (tested or predicted M458 and not L260) would benefit from the L1029 test, because Np cannot be predicted precisely, and because there is a low fraction of L1029- outlier samples, not fitting Np.

So far (10 Oct data) there are 20 results L1029- (including a few samples that are not M458+) and 42 results L1029+. N type requires 67 or more of the standard markers for confident assignment. Using samples with those 67 markers the numbers are 114 N type, of which there are 12 L1029- and 41 L1029+. Of the 61 remaining N type samples (at 67 in the Polish Project) not tested for L1029, I estimate only about 5 might come out L1029-, because testing has been concentrated on STR predictions, discussed below in this topic.

One M458+ L260- L1029- sample is not counted as N type, as discussed in the next topic as Ry type. This seems to be a very small outlier clade with an old node in M458.

Two of the others differ significantly from the other 12, so I am predicting these two as outliers, with M458 nodes older than the main Np hypothetical clade.

Np Cluster Definition: I constructed an STR definition for the remaining 10 samples with similar STR values and L1029- result. The definition uses 37 of the 67 markers. The cutoff is 2 (step less than 2 are considered matches). I uploaded this definition to Ysearch, code CHFXB. My analysis file is L1029Study.xls

On this basis, 3 of the untested N type samples fit the definition and are predicted L1029- members of the hypothetical Np clade. Two more are marginal, so perhaps there are 14 Np samples among the 114 N type. N type is 8.8% of the Polish Project, so that means 14 / 114 * 8.8% = 1.1% Np samples in the Polish Project. The statistical uncertainty is wide, so my estimated 80% confidence range is 0.5% to 2 %. Insofar as the Polish Project is representative of Historical Poland, it seems the Np hypothetical clade has roughly 1% frequency in the region of Historical Poland. Of the 10 confirmed Np samples, 8 provide “Poland” as origin, one “Russian Federation” and one “Lithuania”. The 3 predicted Np samples have two “Poland” and one “Belarus”. There is no need to subtract the samples without “Poland” because the Polish Project as a whole has a similar frequency of samples not “Poland”; such samples come from men with evidence of male ancestry from Historical Poland.

DYS460 = 10 is a very strong signature marker for Np. All 13 of the confirmed and predicted Np samples have this value. Those two outlier samples also have this value. Among those 41 L1029+ samples, only 6 have this 10 value; 3 have 12 and the 32 others all have the N type modal 11 value. The statistics of this paragraph are misleading because DYS460=10 was used to encourage L1029 testing in the Polish Project. I would expect a few Np to show up in the future with 460 value other than 10 (mutated from the Np ancestral value), and I would expect in the long run a lower fraction (less than 6 / 32) L1029+ to have the 10 value (independent mutations). Among the 49 N type samples not confidently assigned to sub-categories, only 5 have the 10 value, and 1 of these is a marginal Np sample mentioned above.

CDYa = 33 is another good signature. These two markers alone with cutoff 1 (that means both markers match) capture 9 of the 13 Np samples (Np defined as 13 captured by 37 markers cutoff 2). These two markers also capture 2 marginal samples (at the step 2 cutoff of Np at 37), plus only one other N type, plus a few D type (D are not members of the M458 clade, but DYS460=10 is modal in D). CDY is a fast mutator, so it is unusual to serve as a signature marker. I ran into this on one other occasion, where I postulated a mutation disabled CDYb; see my discussion athttp://www.gwozdz.org/L540.html#CDYb. Actually, another reasonable explanation is that this CDYa=33 signature is just luck, because using only 10 samples we should not be too surprised that one of the rapid mutators looks like a signature, by the luck of random mutations. Yet a third explanation: Np might really be 2 or more clades where the ancestors (MRCAs) of each clade had the CDYa=33 value by luck, but those ancestors differed at other markers; this explanation is discussed more below.

There are no more good Np signature markers. Np modal values differ from N modal values at only 4 of the 67 markers. There are only two Np samples at 111 markers, and they do not seem to differ from N at those additional 44 markers. On this basis, I am not confident that my definition is very precise, because it takes as little as 2 mutations in the male line history for a sample to be incorrectly predicted, using any STR definition.

There is another reason for my uncertainty about my 37 marker Np definition: I worked harder than usual to construct this definition, so there is selection bias. Markers that just happen to have no mutations in those 10 samples are all in the definition. Any marker got dropped if it produced 2 or more mutations in any sample of those 10. Surely as more samples show up I’ll need to modify my definition. Those 37 markers are only a “good bet” definition for Np prediction today.

I published my SBP method of quantifying confidence in clade predictions based on Y-DNA STRs. Lower SBP means higher confidence. I reserve the word type for clusters with SBP < 20%. I consider SBP meaningless for SBP > 50%. Np comes out with SBP = 64%. This does not necessarily mean that Np is invalid as a clade prediction. My SBP method gives larger values for SBP with few samples, so valid clades improve with more data (SBP becomes smaller). A clade with modal STR values close to the father clade (N is the father of Np) necessarily comes out with large SBP. Concentration in Poland is evidence of validity for Np. That 460=10 is also evidence of validity. In my estimation, Np has about 80% confidence of validity, all evidence considered, but only 50% confidence of being a unique clade. Np might be primarily one clade with interference from other independent small clades with similar STR values. Or, Np might be 2 or more clades, about the same size, all concentrated in Poland, but distantly related. Clarification: two clades with very close nodes to the father branch might be considered a single clade; here I mean that Np might be 2 clades with nodes that are not close in the tree, perhaps with other small clade nodes between them that do not fit Np STRs (by the luck of random mutations in the ancestor). More discussion below on this idea.

In the R1a Project, my 37 marker definition captures 11 samples with SBP = 95% (data at 67 markers, download 14 Oct). Eight of the 11 have L1029- result and the others are not tested yet. Seven of the 11 are of “Poland” origin. Two L1029- are N type that do not match Np. There are 38 L1029+ that do not match Np. Summary: L1029- are rarer in the R1a Project (compared to the Polish Project) and the L1029- predominantly match Np. SBP is worse (higher) because of interference at the cutoff by more R1a samples from outside Poland. This paragraph is not conclusive, however, because the administrators of both projects work together; many of the samples come from men who joined both projects. Both projects worked hard on getting L1029 results this year, using 460=10 fit as a guide for emphasis.

As an independent test, I checked (11 Oct) the “RussiaDNA” Project (another FTDNA project). Of 260 R1a total, only 12 have been tested for L1029, and only 2 of these 12 came out L1029-: one Poland and one Russian Federation. This is preliminary evidence that Np is rare in the Russian federation, although N is common in all Slavic countries.

LituaniaPropria: 4 L1029 tests, two negative, both “Lithuania” origin, one L1029- also in the Polish Project, both also in the R1a Project

in addition, both L1029+ are also in the Polish Project, and one in the R1a Project, so these are not independent data

Other projects are not concentrating on L1029 tests. I hesitate to encourage them, because M458+ L1029- seem to be mostly from Poland.

I have an R1a database at 67 markers with 1816 samples from 15 FTDNA projects. I collected this 20 June, when there were fewer L1029 results. My 37 marker definition captures 13 samples, but 12 of these are in the Polish Project, and the other is in the R1a project. No additional samples fit Np. There are more marginal samples at the cutoff step 2: 10 of them: only 2 in the Polish Project; only one from Poland. This is my strongest evidence that the Np cluster is concentrated in Poland.

Ysearch: 9 samples are captured by my Np definition CHFXB. Only 2 are from Poland. Only 2 of the 13 Polish Project Np joined Ysearch (one Poland and one Lithuania). SBP is poor for Np at Ysearch because there are 6 samples at the step 2 cutoff, none from Poland. In addition, 2 “Central European” modals fall at step 2 (37 markers used), emphasizing how hard it is to separate Np. A simple explanation for these Ysearch results is that there are 1 or more other clades concentrated outside Poland, which might be L1029- or L1029+.

At the top of this topic, I reported “more than 90% of N type” (M458+ L260-) are L1029+. Since L1029- are concentrated in Poland, it may actually be more than 95% worldwide. However, there is a reasonable possibility of one or more small clades showing up L1029- from outside Poland when more samples are tested.

Age of Np: It is too soon to estimate the age (TMRCA) of L1029, and age based on STR variation is uncertain because of known caveats. However, L1029 is probably not much younger than N type because L1029 includes almost all of N type. N type is surely older than 2,000 years. Indeed, variation of L1029 STRs is looking similar to N type variation. The L1029- node is necessarily the same or older than the L1029 node, so Np has an old node. However, the age of the node is almost always older than the age of the clade (TMRCA). Np seems very young, as evidenced by the unique 460=10 value discussed above. On the other hand, other markers have significant variation within Np; that may mean Np is not so young; or, that may mean Np is composed of 2 or more clades, each of which is young.

Speculation: Np reminds me of P type (L260 update, next topic). In my 2009 publication, and at this web page, I have speculated that L260 may have a very old node, but the P type ancestor (MRCA) may have lived more recently, perhaps not long before formation of the tribes that led to the Polish nation. It seems to me that M458 is quite old, but not many M458 individuals survived over the millennia, and a few of the M458 survivors were lucky enough to found clades during the population expansion of the last 3 millennia. Perhaps the Np ancestor, with L1029- and 460=10, also lived long ago and left few survivors; most of those few formed what are today very small clades, and one was (or perhaps 2 or more, all with 460=10, were) lucky enough to found the medium sized cluster today apparent as Np. I find it interesting to consider the men who lived 1,000 to 2,000 years ago in the region that is now Poland (and / or maybe in another region from which there was a migration to Poland). Due to the statistics of Y-DNA inheritance, most men do not form clades that last long, and very few men form large clades. Human behavior may perhaps broaden the statistical spread of clade size, allowing rare men to produce relatively larger clades. I speculate that among those proto-Polish men who founded clades that survive today, most were R1a, and many of those were M458, and one or a few of those were Np, and one was P.

Comment 17 Oct 2012: This topic need update rewrite. L1029 has been updated in the previous topic.

The new SNP L1029 includes most but not all of N type. A few M458+ samples have turned up that are neither N type nor P type. One L260+ sample has turned up that is clearly not P type.

For this update, I collected in June 2012 a database of 3,586 unique R1a samples, from 15 large FTDNA Projects, each with significant R1a data.

At this web page, I have been saying that L260 seems almost equivalent to P type, and that M458 seems almost equivalent to N type plus P type. I say “almost” because there have always been borderline samples. Recent data continues to confirm this general summary, as discussed in the following paragraphs:

Ry type: There is a family set (five samples with the same family name, very close STR match to each other) where one of them recently (Mar 2012) tested M458+ L260- L1029-. These five are clearly not N or P. Not even close in STRs. These 5 samples are now categorized in the Polish Projectas “Ry type”. These were independently noticed by Lapinski, the administrator of the R1a Project, with a new category for these in that project, also.

L260+: In that 3,586 database, there are 79 with an L260+ result. 53 of these give Poland as “Origin” of ancestor. There are 6 from Germany, 4 each from Czech Republic and Russian Federation, 3 each from Hungary, Slovakia, Ukraine and Unknown. L260 is clearly concentrated in Poland. Many men (samples) join multiple projects; 67 of the 79 joined the Polish Project and 12 did not. 71 of the L260+ samples have the 67 marker standard set, 61 of them are in the Polish Project.

The following analysis uses all 1,816 samples that have the 67 marker standard set.

There is one sample recently tested L260+ that does not match P type. The P type cutoff is step 7 using the P43 definition (samples are predicted P type if mutation count is less than 7, from the P type modal haplotype, using 43 of the 67 standard STR markers). That one recent outlier is step 11, not even close to P. There are six other L260+ outliers: one at step 9, two at 8 and three at 7.

By comparison, those Ry samples vary from step 11 to 14 from P type. The N46 definition has cutoff 8. That P type step 11 outlier is at step 16 from N type, and the Ry samples vary steps 10 to 13.

For now, we are categorizing those P type outliers as P type for convenience, although I suppose these samples are evidence (not proof) that the L260 SNP is somewhat older than P type (the hypothetical clade with strongly correlated STR values - see L260 and M458 for clarification of these words). I expect future updates of this web page will introduce new categories for some of these outliers.

P type has no foreign outliers. No samples predicted P type (step <7) have so far come out L260-.

There are 14 samples at the cutoff step 7, compared to 152 P type samples <7, for SBP = 13.6%. Of those 14, 5 have been L260 tested: 3 positive and 2 negative.

That’s for the large 1,816 database. The Polish Project, part of this large database at 67, includes 114 of those 152 P type, and 8 of those 14 at step 7, for SBP = 11.7%. The lower SBP is a demonstration that P type is concentrated in Poland. That 11.7% is an upper limit estimate of borderline P type samples at 67 markers, but P type is actually much more isolated than that in the Polish Project: Five of those 8 at step 7 are confidently assigned to other types and haplogroups. Of the remaining three, two have tested L260+ and the third remains classified as P Borderline (L260 test needed), along with a couple samples at step 6.

This is all evidence (not proof) that P type is likely younger than L260, and that there are probably at most only very few small branches (twigs) older than P type within the L260 haplogroup (see L260 and M458 for clarification).

The new L1029 haplogroup: Only N type samples are coming out L1029+, but many N type are coming out L1029-. L1029 is clearly splitting N type, including more than half but less than 90% of N type. Watch this topic for an update of the percent as more data comes in. These are categorized as “Np type” at the Polish Project; there are 14 of them so far (13 July update): 7 with M458+L1029- test results, 5 with only an L1029- result and an STR match to N type, and 2 with neither SNP test that match the other Np samples closely in STR values. I need to add a new topic for Np to this web page.

In that 3,586 database there are only 14 samples tested L1029- (late June): one was that Ry sample discussed above and one was P type. The other 12 match N type: 9 were Np samples in the Polish Project (in June) and 3 were “L1029- N type” in the R1a project; 8 of the 12 are confirmed M458+.

Np seems to be concentrated in Poland, but it is too soon to be sure, because the Polish Project administrator has been active encouraging testing of these. In those 12 Np samples in that 3,586 database : 8 give Poland as origin, 2 Russia, 1 Lithuania, 1 Germany.

N type has relatively more samples at or just beyond the cutoff, plus a few foreign outliers. Next, I need to write, here in this topic, a similar summary for N type.

I last updated the P type definition 16 Aug 2011. I last updated the N type definition 13 Sep 2011. Previous updates used only the Polish Project data. Next, I plan to update both definitions using the larger June database. Because of selection bias, definitions improve as more data accumulates. Watch this topic for updates.

If you are P type or N type you would likely come out positive in the SNP test for M458 (M458+). If you are P type you are likely L260+. N type is likely L260-. If you have not already tested you can pay the small fee to test for these SNP tests to confirm that you belong to the correspondinghaplogroup.

If you are assigned to P borderline or to N borderline you would benefit more from the M458 and L260 tests, because that would provide for you a definite assignment within R1a.

The assignment rules are done with high probability, so if you are unassigned (category U) there is a low probability that you would test positive for M458, with probability that decreases with your step (genetic mutation distance) from P or N.

If you have less than the standard 67 STR markers it is generally better to purchase the remaining markers. That way, you are more likely to get an assignment, because the statistics for STRs improves with more markers. Nevertheless, if you are not many steps from P or N you might consider doing the M458 test even with fewer than 67 markers.

There is a slight chance that you might test positive for L260 or M458 even if you do not match P or N. The haplogroup corresponding to M458 is old enough that there may be small clades with STR markers very different than P or N. I have not seen one yet, but there is no way to estimate this probability. I hesitate to recommend the M458 SNP test for men whose samples are distant from both P and N in STR values. I admit you can just wait to see if anyone with STR values similar to yours matches an SNP, then test for that SNP. However, we all benefit when some men test for all the new SNPs within an established haplogroup, because that way we find out the size and rough age of the corresponding new haplogroup branches. FTDNA offers “deep clade” test packages to test for all possible haplogroup branches, but my understanding is that L260 and M458 are not yet included in the R1a deep clade test. You need to purchase them separately from the advanced markers menu. No doubt FTDNA will add them soon to the deep clade package.

Part I is my “mountains in haplospace” method for evidence that certain “types” of STR clusters correspond to clades.

Part II is the application of that method to Common Polish Clades. That article has a lot more detail than this web page, but that article was published in the Fall of 2009, so this web page serves as an update.

PolishCladesUpdate is my folder for updates of the Excel analysis files for those two articles.

This web page will continue as an introduction and summary, without as much jargon and detail as the articles and update folder.

11 Jan 2011 update: There is a lot of activity these days in the discovery of new SNPs for dividing R1a into branch haplogroups. You can follow the activity at the R page of the ISOGG Y-DNA tree, and also at the FTDNA Draft tree.

The new SNP named L365 includes what I have been calling G type, based on preliminary data. It is too early to say if other samples in addition to G type are positive for this new SNP.

The new SNP named M417 excludes what I have been calling C type, based on preliminary data. So far very few R1a samples are negative for this new SNP, but it is too early to estimate the rarity of M417-.

In early 2011 FTDNA released some new SNPs for commercial testing, including the following for R1a: L365, M417, L366, L291, and others. To order new SNP tests, go to your home page at FTDNA, on the left under “My Account” click on “Order Tests & Upgrades”, then click on “Go To Advanced Orders” and check “SNP”. Use your browser search to find the SNP of interest. If you wish to publish your results, join one of the projects (click on “Projects”) and the administrator with analyze your data.

There are other new experimental SNPs discussed on the web. I’m not trying to list everything here, just the ones that are of interest for discriminating new R1a haplogroup branches.

This topic is an example of the confusion of haplogroup names. I’ll not update this “Confusion” topic because this 2010 discussion is a good example. This topic is technically out of date, but it still serves as an example. This confusion applies to all haplogroups, not just R1a.

25 Oct 2010 update: New SNPs cause confusion in the alphanumeric notation for the haplogroups and paragroups.

In my fall 2009 publication I used the notation that was well known at the time, where more than 95% of R1a was known to be paragroup R1a1. The R1a1 samples with one of four very rare SNPs that had been known for a few years were called haplogroups R1a1a through R1a1d. Ysearch still (25 Oct 2010) uses the notation described in this paragraph. FTDNA Projects still use this notation for automatic assignment of samples. Individual samples are not actually assigned to a paragroup because most have not been tested for all SNPs. Most R1a samples are listed as R1a1. Many samples are listed as just R1a but almost all of those would come out R1a1 if tested for the appropriate SNP (the well known M17 or M198, or one of the new ones that all seem to be equivalent). I mentioned in my publication that all Polish Project R1a were coming out R1a1. Since then only one sample (out of 1441 R1a total in the Polish Project) has come out M198-.

New SNPs were discovered equivalent to SRY10831.2, the original R1a SNP. Subsequently, rare samples were found positive for some of these new SNPs but negative for SRY10831.2. I’ll use L62 to represent these; there are others that seem to be equivalent. Those define two small paragroups, R1a(L62, SRY10831.2-) and R1a1(SRY10831.2, M198-). That previous R1a1 paragroup becomes R1a1a(M198). Accordingly, when Underhill announced the M458 SNP, he called that haplogroup R1a1a7. L260 was called R1a1a7b when first discovered. In spring 2010 I rewrote this entire web page using the notation described in this paragraph.

The recent new SNPs change the notation again. I shall not attempt to rewrite this entire web page. As I update topics, I’ll use the current notation. For clarity, I’ll add the defining SNP in parenthesis when I do updates.

For example, what I have been calling P type is equivalent to the haplogroup now called R1a1a1g2(L260). What I have been calling N type is equivalent to the paragroup R1a1a1g(M458, L260-).

The choice of which SNP to put in parenthesis is arbitrary for haplogroup notation. For example, R1a1a1(M17), R1a1a1(M198), and a few others, all seem to be equivalent. But any day now someone might announce a few samples that test negative for one of those SNPs and positive for all the others, which would define a new paragroup and force the renaming of all branches beyond that new node in the tree.

There is ambiguity in assignment of samples. For example, a sample that tests negative for M198 might be called R1a(M198-), but it is not clear if this sample belongs to the paragroup R1a(L62) or to the paragroup R1a1(SRY10831.2) if it has not been tested for the latter.

My types have an uncertainty similar to SNPs. For example, I said N type is equivalent to R1a1a1g(M458, L260-). Recently two samples showed up in the Polish Project that are M458, L260- but just beyond N type as defined by STR fit. We can think of these two as a new “paratype”, although I’ll not use that word. We classify these two in the Polish Project as “M458+R”, the Remainder in M458 excluding N type and P type. Actually, as I discuss in the N type topic, it is not statistically certain where to place the cutoff for N type, so you could argue that the M458+R category has more than two samples in the Polish Project.

L260 is an SNP that I published in the Fall 2010 issue of JOGG. It has been available as an SNP test since early April 2010 at FTDNA.

M458 was published by Underhill. It has been available as an SNP test since early November 2009 at FTDNA. L260 is a branch of M458.

A new SNP, L1029, has been available as an SNP test since March 2012, also a branch of M458.

Because of the confusion of recent SNP discoveries ISOGG now uses the haplogroup nomenclature R1a1a1g1 for M458 but FTDNA still uses R1a1a1g. Similarly, L260 is R1a1a1g1b at ISOGG but R1a1a1g2 at FTDNA. ISOGG has the new L1029 as R1a1a1c. The FTDNA draft tree has some more recent SNP discoveries listed in the R1a branches. The R1a Project home page has a nice recent diagram of the proven R1a SNPs.

Both P type and N type are code names published by me before these SNPs were discovered. At this web page, I have been saying that L260 seems almost equivalent to P type, and that M458 seems almost equivalent to N type plus P type. I say “almost” because there have always beenborderline samples. In Feb 2012, a few samples turned up that are exceptions: M458+ samples that fit neither P nor N. Not even close with STR values.

Reminder: There is a logical distinction between an SNP haplogroup and an STR type: I use the word “type” for clusters of samples where I have 80% or higher confidence that the type corresponds to a unique clade. I use the word “borderline” for samples that seem to have 50% to 80% confidence of belonging to that clade. P type and N type are very well isolated in haplospace, with relatively few borderline samples. P type is particularly well isolated. The age of a type is the MRCA (ancestor at the hypothetical node for the type). A new SNP may be younger than a type, capturing only part of the type. A new SNP may be older than a type, capturing all of the type plus additional samples that are not predicted into the type. L1029 is a branch of N type, clearly younger than the type.

Even if a new SNP captures all the samples of a type, future samples may show up that are positive for the SNP but do not fit the type, not even Borderline. These might be members of older branches (branches with nodes older than the type), or they might be statistical outliers (members of branches within the type, where these particular samples have significantly more mutations than statistically expected due to luck). Also, there may be some samples that fit the type with STR values but do not test positive for that new SNP. These might be members of the oldest branches of the type, older than the new SNP, or they might be outliers from other clades with distant nodes.

MRCAs for N and P must have lived long after the node for these two branches in the Y-DNA tree, because the STR definitions for N and P are very different (compared to definitions of other haplogroups, with yet older nodes). I say “must have” because this is a statistical conclusion; it is possible but very unlikely that N and P have a node not much older than the two MRCAs, and that those two men had unusually divergent STR values due to the random luck of mutations.

Reminder: My definitions use selected STR markers for optimum statistical prediction of which samples belong to a type. Because of selection bias, the definitions change slightly (one or more markers added or subtracted) as more data accumulates. See L260 and M458 News for status.

Almost all of R1a divides into R1a1a1* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260). These correspond to my original predicted division.

R1a also has several known rare groups: R1a*, R1a1*, R1a1aN, where N = 1 to 6 and 8. There is also a very rare R1a1a7a. That asterisk is used for paragroups; R1a1a*, means haplogroup R1a1a without any of those 8 known branches.

The rare R1a groups are not in my R1a Table. It’s a shame the corresponding STRs are generally not published in SNP announcements. I don’t know if the rare groups all together add up to 0.1% or 1% of R1a. Surely they are less than 3%. My percentage calculations in my R1a Table do not need adjustment because any Ysearch samples that might belong to these rare clades would probably have unusual STR values, not falling into one of my types, but still be counted in the totals. In my R1a Table, rare samples are included in row R. That row R might have a few percent from these rare groups, but I don’t know exactly how many.

Underhill mentions 7 samples (men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.

Lawrence Mayka, the administrator of the Polish Project, had been assuring me by email that all the Polish Project member tests within R1a had been coming out negative for all the rare SNP subgroups. So if you are a Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the men from Poland. About half of these - about 1/4 of men from Poland - are R1a1a7. These two “about” estimates are approximate; my data on these SNPs are not random samples, so my population estimates are derived from the types in my table, which are STR based.

On 17 June Mayka informed me of the first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish Project. My table, does not show this single exception because the table is for samples with 67 markers, which that one exception does not have. On 19 June Mayka informed me of evidence that C type might define a new rare subdivision of R1a slightly older than R1a1a; if this turns out correct it will be less than 1% of R1a.

An article was published online, 4 Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.

I call this article “Underhill” for short, because his is the lead name in the list of 34 authors for this major work.

This web page about Polish Clades was completely rewritten using this new information. Recent L260 and M458 test results are consistent with (albeit not full proof of) my previous R1a subdivision into “types” here on this web page about Polish Clades.

Briefly, most of R1a1a is split by this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-). See R1a Subdivision for a brief summary of other groups, and for a clarification of what R1a1a* means.

R1a1a7 is the new M458 haplogroup. R1a1a7 includes what I have been calling P type and N type here on this web page, even before M458 was available.

R1a1a* is a new paragroup. This is M458 negative. It includes all my other types, particularly K type.

The 70% confidence interval for R1a1a7 is about 50% to 60% in the Underhill Poland data.

Worldwide 77% of the Underhill data is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).

M458 Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a few percent more R1a1a7 than R1a1a*, although the latter is more common worldwide.

Up to here, I have tried to write this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity with genetic genealogy jargon has understood. If you read this top to bottom, it gets progressively more detailed, with more and more jargon. I’m sorry about that, but the audience is also readers with genetic genealogy experience who want to know how I came to my conclusions. If you cannot follow some of this, it is written in a manner that you can jump around and pick out what you do understand, then come back after you have read more about genetic genealogy.

If you open this html document with Word, all the link targets (bookmarks) can be viewed alphabetically or by location.

Lawrence Mayka is the administrator of the Polish Project. Click on the Polish Project web link to see how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA and for Y-DNA. This web document of mine is restricted to Y-DNA, with emphasis on R1a. I help Larry with assignments to types and clusters.

Haplogroups are defined by SNP mutations. STR mutations are easier to test, so many samples have STR data without SNP data. Predicted assignments are based on STR correlations when SNP data is not sufficient, allowing assignment to smaller branches in the Y-DNA tree. STR based predicted assignments can be verified with the corresponding SNP test.

I mentioned above that haplogroup assignments based on SNP tests have virtually 100% confidence.

The Polish Project includes FTDNA assignments, which use a color code; green text means assignment based on an SNP test; red text means assignment based on STR prediction. I do not know the exact FTDNA computer algorithm for those red STR based predictions, but it is conservative; I notice they have more than 95% probability - less than 5% of those end up in different haplogroups when they are eventually SNP tested.

The Polish Project (Mayka) also makes haplogroup assignments based on tests for new SNPs that are not yet included in the FTDNA computer assignment algorithm. (FTDNA periodically updates the algorithm when there are many new SNPs available.)

The Polish Project provides assignments to terminal haplogroups using minimum 80% estimated probability based on STRs. The men with such assignments can verify their assignments by ordering the corresponding SNP test. Usually, these assignments are only for samples with 67 or more STR markers, because confidence is lower at fewer than 67. There are some exceptions, haplogroups where assignment is possible with fewer markers.

In addition, the Polish Project provides Borderline categories for samples with estimated probability 50% to 79%. Again, men can verify such assignments by ordering the corresponding SNP test. It is possible that samples coming out negative for the corresponding SNP might be assigned to another borderline category, for another “best estimate recommended” SNP. Again, borderline assignments are usually for samples with 67 or more STR markers.

In addition, the Polish Project assigns samples to hypothetical future haplogroups, which are proposed subdivisions of known terminal haplogroups. These may be verified in the future by newly discovered SNPs, but not yet. In the Polish Project, such hypothetical haplogroups are called types if confidence is 80% or greater, as evidenced by SBP 20% or lower. Those with less confidence, SBP higher than 20%, are called clusters. So the name of the category provides an estimate of our confidence. Types with high confidence sometimes have marginal fits --- samples that match with less than 80% net probability, so borderline categories are used to communicate the uncertainty. Again, such assignments are usually for samples with 67 or more STR markers.

Sometimes we use a Remainder category for paragroups, which means the remaining samples from a haplogroup that have not been assigned to subdivision branch categories. Remainder categories are usually not possible because a large number of negative SNP results would be required for a haplogroup with many known branches; instead samples are assigned to a best estimate borderline branch category.

The overall average confidence of assignments in the Polish Project is much better than 80%, because most samples fit their assignment with 90% to 99.9% probability.

I have a separate topic explaining the U Categories, for “Unassigned”, for samples without enough data for meaningful assignment. In the Polish Project, we do not use the U category for samples with 67 or more STR markers - if they cannot receive an assignment with 80% confidence we give them a Borderline, or Cluster, or Remainder assignment. Most of the samples in the U categories have only 12 STR markers. Of course, U samples (all samples) still have their FTDNA assignment to a main branch.

Update 15 Feb 2012. The Results Table is based on data from the Polish Project. The data was downloaded on 7 Feb 2012, at which time there were 1903 Y-DNA samples (men). 1071 of these have data from 67 or more STR markers. Data was edited for family sets, 57 samples, as explained in my publication. Net 1846 samples.

Polish Project Assignments are taken as representative of Poland, with caveats explained in my publication.

I did the editing and tabulation in an Excel file, which is available: ResultsTable.xls

For haplogroups I1, R1a, and R1b, assignments to clusters and types are made using 67 markers. Samples with fewer markers cannot be assigned with confidence. For this reason, the Results Table uses only the data at 67 markers for these three haplogroups. The totals are indicated in the Results Table. That Excel file has analysis sheets for each of these three. As indicated in those sheets, some Unassigned samples with 67 markers but insufficient SNP data were assigned to SNP haplogroups based on my estimates of how many of the Unassigned samples would fall into various haplogroups if SNP tested.

Column Haplogroup or Type has labels determined by Mayka. Most of these are branch haplogroup (or paragroup) codes, with the defining SNP in parenthesis. Some of these are types as defined by me. A few of these are clusters. A few of these categories are for borderline samples, or for unassigned samples as explained in the corresponding sections of this web page.

Column Short Code is my own code for use in this web page. Some of these have links for jump to a description of that particular clade. Some have a Ysearch link in the far right column. Most do not have links because I have not found the time to work on them; my priority is clades that seem to be concentrated in Poland.

The Num and % columns are the number of samples for each category, and percent of the total. The number of samples mentioned in those detailed descriptions (below) may not correspond to the numbers in the table because the particular description updates may have been done at a different times than the table update. The description section has descriptions of some experimental subtypes that are not listed in the Results Table.

ISOGG names change often due to new SNP discoveries. See R1a Confusion for examples.

Those types and subtypes and clusters are my own code letters, for brevity. Please do not confuse these code letters with official haplogroups. I have been using such code letters for R1a assignments in the Polish Project since 2007. Because half of Polish samples are R1a, I do not use “R” for R1a codes; all other short codes start with the haplogroup letter.

My Update Folder has an Excel analysis file for most types, plus many more files.

The Ysearch links provide the modal haplotype definitions, using a selected subset of the standard FTDNA set of 67 markers. I entered these data into Ysearch for our convenience. All my modal haplotype definitions are available in the Excel file Haplotypes.xls, which also has experimental types not mentioned here. Below are Ysearch instructions for quickly comparing your haplotype to many of my types at once.

Column % provides a good estimate of the frequencies in Historical Poland, insofar as the Polish Project is representative of Historical Poland, as discussed in my publication.

With just under 2000 samples, each sample represents just over 0.05%. The Results Table rounds to nearest 0.1%, so one sample or two samples both get rounded to 0.2%. The statistical uncertainty is very high for clades with few samples. Most worldwide haplogroups are not present in the Polish Project, but it is statistically very likely that many haplogroups present in the Polish population at 0.5% are not represented in the Polish Project just by the luck of sampling statistics. At 67 markers, with just over 1000 samples, each sample represents just under 0.1%.

Updated 9 Aug 2013. “Unassigned”, or short code name “U”, is not a cluster, but a holding category for samples with insufficient data, so assignment cannot be made with high confidence in the Polish Project. U is a subcategory for multiple main branch haplogroups, for samples that obviously belong to that main branch but do not have data that if obtained should help assign them to a sub branch. For example, there are U categories associated with I1, I2b1, R1a1a1, and R1b1a2.

FTDNA makes assignments using as few as 12 STR markers, and even with no SNP results, but these are main branch haplogroups, not used as assignment categories in the Polish Project. The Polish Project aims for smaller sub branches. The FTDNA main branch assignments are visible at the Polish Project page.

R1a1a1-U is the largest category in the Polish Project, with 11% of the samples. All R1a1a1 samples with all 67 standard STR markers are assigned to the best fit category, not to “Unassigned”, although that may not be possible for all samples in the future.

Many samples with incomplete data are assigned to a category with 80% or greater confidence, but this is not generally possible. For example, some categories have rare STR signatures, so 37 or fewer STRs may be sufficient for some samples. If a sample has close STR matches (often obvious relatives with the same ancestor named in the data) all such samples are assigned to the same category. In other words, your assignment to a category other than U means the Polish Project judges no further data to be required at this time. However, confidence may often be increased beyond 80% probability by purchasing more STR markers and / or recommended SNP tests.

A standard 111 STR set has been available for some time, and many samples in the Polish Project have all 111. In the future, but not yet, 111 markers may be required for assignment to some categories.

There are separate topics below for descriptions of selected categories in Haplogroups I, N, and R1b.

Comment added 20 Oct 2012: This is a long topic with many short subsections, each for a category. Many of these subsections are out of date and need to be rewritten. The subsections without a date on the first line may be a few years old.

This large topic has descriptions for many of the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters. Types and clusters are high confidence hypothetical haplogroups. Borderline categories are lower confidence. There is also the Unassignedcategory for uncertain samples.

Click the Ysearch web links in the Results Table for modal haplotypes, which are my best fits of web data to groups of men with similar STR data.

Please don’t get confused. The following capital letter names are my codes for R1a categories. Capital letters are also used for the large official haplogroups, but that’s different.

Some of the following categories are discussed in my November 2009 publication, and may have archive copies of my 2009 Excel analysis files stored in the Supplementary folder. Many of the following types have my update Excel analysis at PolishCladesUpdate.

A type is hypothetical clade of L342, which is a branch of Z93. A type does not correspond to a haplogroup yet, because there are L342+ samples that do not match the A type definition.

This type is discussed in my publication, Part II. The definition, using 67 markers, has been available since 2008 at Ysearch, as FCUFG.

I have consistently expressed more than 98% confidence that A is a valid clade, not just because of my work, but because the modal haplotype closely matches the various versions of the most common Ashkenazi haplotype, which has been widely studied and reported on the web. It should be emphasized that not all Ashkenazi match this type, and some men in this type may not be descended from Ashkenazi. This type is not restricted to Poland. Levy-Coffman wrote an article about Ashkenazi genetic genealogy; I noticed discussion in a recent Science article. I expect an SNP to show up someday equivalent to what I have been calling A type.

Between 2008 and 2011 I predicted that A type was a subtype of K type, but I never had more than 80% confidence in that prediction, which is now seen to be wrong, because K type is in Z283, a brother SNP to Z93. See the R1a Project for a recent SNP tree. The match of A type to K type at the first standard set of 12 markers is now seen to be a coincidence. Older publications call that 12 marker haplotype, very common in Eastern Europe, the “Ashkenazi” haplotype, but we not know that only a small fraction of men who match at 12 markers are Ashkenazim.

B. Update 8 Mar 2012. A hypothetical subtype of K type, identified by Mayka. Concentrated in Poland. I have more than 90% confidence that B type represents a clade that will be verified some day with a new SNP discovery. My confidence is only about 80% that it is a subtype of K; the node for B type in the R1a tree might be slightly younger or slightly older than the K definition node. Individual assignments to B type have 80% or higher confidence, depending on how closely each fits.

C. Update 10 Mar 2012. This type code name was dropped from the Polish Project in early 2011. The two C type samples are both now listed as R1a1a (M198+,M417-), and they are the only samples in this paragroup, so that is a better label. These are the only two R samples in the Polish Project with the signature (385a,455) = (13,10). C was added to Polish Project in Dec 2009 by Mayka, who pointed out that Didier Vernade originally pointed out the unusual DYS392=13 value in 2007. DYS392=11 is almost universal in R1a. C type is very small. There are only 2 Polish Project samples in C type, only 1 at 67 markers, but this type is well isolated on Ysearch, with 4 different samples with 67 markers. I calculated SBP = 7% using only 37 markers with Ysearch data (in early 2010). None on Ysearch are identified as “Poland”. C type differs very much in STR values from the rest of R1a1. That is because C type has a very old node in the R1a tree.

This type was added to the Polish Project in Jan 2010. The cluster was brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some time ago, based on DYS462=12.

Signature (460,481,462,560) = (10,<22,12,18). Any one of these four markers by itself can distinguish D type with high probability from other R1a1a1i (Z280) samples, but those values can be found individually as independent mutations in other R1a clades. D type cannot be distinguished using the 25 FTDNA standard markers. At 37 markers, only 460 is available.

At 67 markers, 481<22 is an effective signature: 16 total D type: 13 D have 481=21, and only one other R1a sample has the 21 value. 2 D have <21, with no other R1a samples. One D has the 22 value along with several other R1a. 481=25 is modal for R1a.

DYS462 is a standard STR marker at Sorenson, and has been available for years at Ysearch; 462 is now available at FTDNA with the 111 marker set. In Nov 2011 I noticed that DYS560=18 is another marker for D type from the 111 set, but that is not available at Ysearch (Nov 2011).

That DType.xls analysis file provides SBP = 5.3%, although I did manual editing of the definition to improve SBP, providing some selection bias. On the other hand, isolation of D type is even better than indicated by SBP for two reasons: Samples just beyond D type, steps 12 and 13, all have solid assignments to other types. Most of the D samples have 462=12 and a few have 560=18, and those samples beyond step 11 with data have other values at those 2 markers, so a future definition using all 111 markers should provide even better (lower) SBP. Only 3 D type have 111 markers; most of the DYS462 data was obtained some time ago by purchasing that marker separately.

D type seems to be Z280+ Z92-, based on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file). Z92 is a new SNP, so not much data is available; confirmation should be available soon. D is a subtype of what I had been calling K type; I’m now using K as a code for theparagroup defined by Z280*.

D type is clearly a Polish type: In the Polish Project 10 of the 16 D type at 67 markers indicate “Poland” ancestry; the exceptions are 2 “Unknown” (one with an obvious Polish name and one with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.

On Ysearch, there are 32 samples below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which is quite high for Ysearch. SBP is 15% on Ysearch, implying there are clades near the cutoff that are rare in Poland; indeed none of the 5 samples in the gap at steps 9 and 10 indicate Poland. For details see the “Ysearch” sheet in DType.xls.

Age (ASD sheet cell N12) comes out 1,385 years using all 67 markers. Old human Y-DNA clades have age older than the raw ASD calculation because of population bottlenecks and because of other statistical adjustments. However, D type is not very old, so this correction may not be needed. On the far right of that ASD sheet I sorted markers by age, and I added notes about problem values, and suggested four markers that should be masked out, but the age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216 years. I see evidence of subclades, so D type might be composed of younger subclades that might be identified with more data.

I noted three markers (on the far right of the ASD sheet) that I consider hints for subclades. Last year in this topic I mentioned Da, with the signature (458,576,444)= (16,20,14) and that still looks promising, but not convincing. One of the three D samples with 111 markers fits Da, and provides a hint that markers 463 and 715 from the 111 extension might help to resolve Da, so it will be interesting to see what happens as more D men order the 111 extension.

E. Update 8 Mar 2012. V. Rudich entered a modal for this cluster into Ysearch as ID MW7DP, named “North Eurasian”. Mayka modified it slightly for the modal used here by me, GNYBG, named “Belarus”. 67 markers. It’s an excellent type; on 25 May 2010 it had 16 samples at 67 markers in the Polish Project, with SBP = 14%. In late 2011 E type samples tested positive for the new Z92 SNP, corresponding to the R1a1a1g2d haplogroup (ISOGG early 2012). However, not all Z92 + samples fall into types.

FH Clade. F and H types were suggested by Mayka. They have the signature (439,511,452 = 11,11,28). They differ from each other, so I could not make a combined FH type. I can make a reasonable FH cluster, but it is not necessary, since the FH clade can be better defined as the combination of the three types Fa, Fb, and H. The original F type (introduced Jun 2010) was split into Fa and Fb in Dec 2010. DYS452 is not one of the FTDNA standard markers, so not many Polish Project members have this marker evaluated. Mayka and I helped most of the Polish Project members in FH, and members just beyond FH, to get 452 evaluated. Samples beyond FH have 452=30. My analysis files do not use 452 for determination of SBP. 452 would not significantly lower SBP because most of the background near the cutoff for each type are samples from the other two. In other words, Fa, Fb, and H are very well isolated from the rest of R1a, but not so well isolated from each other. These three FH types do not seem to be specifically concentrated in Poland (per Ysearch) although they are concentrated in Slavic countries including Poland. All three types seem quite young, without relatively low STR variance (see the ASD sheets in the analysis files).

FH Borderline. The borderline samples from Fa, Fb, and H are combined into a single FH Borderline category in the Polish Project, because these clearly belong to the FH clade but have less than 80% probability of belonging to any one of the 3 types.

Fa. Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%. See FH clade, above.

Fb. Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%. These samples were the original F type, before Fa was split off. See FH clade, above.

H. Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.

G. This type was suggested to me by Mayka, who calls it the Pomeranian cluster. Pomerania is the name of the region on the south shore of the Baltic Sea including regions of both Germany and Poland. Marcin Wozniak found the G modal haplotype (at 12 markers) to be very common among Kashubians. Kashubians consider themselves an ethnic group or nationality within Poland. It will be interesting to determine if Kashubians in Poland have a higher % concentration of G type than German Pomeranians. Meanwhile, “Pomeranian” is a convenient neutral name, suggests Mayka.

G type is mentioned only briefly in my publication because not much data was available to me at that time. My GType.xls update analysis file with June 2010 data has excellent results: There are 12 samples in a nice type with SBP = 11.2%. There is preliminary evidence of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in Ysearch; see Haplotypes.xls for a list including hypothetical working modals.

11 Jan 2011 news: Mayka informs me that one of the new SNPs, L365, is positive for all of 5 G type samples that were tested so far. A few samples from other types all tested negative for L365. It seems like G type is included in the new haplogroup defined by L365. One of those 5 is in that tentative Ga subtype.

Of course, this is very preliminary. It is possible, if unlikely, that some of the G type samples still might turn out negative for L365. It is quite possible other samples not matching G type might be found L365 positive. I’ll provide updates here.

Those 5 samples are positive for M417, negative for M458, and negative for a few other new SNPs.

L365 is one of a few new SNPs that look like they will receive the notation R1a1a1x, where x = i, j, k, etc.

14 May 2011 comment: Sorry I have not taken the time to update this G type topic. Recent data continues to verify that G type seems the same as the haplogroup divided by L365, now called R1a1a1i.

I. Minor edits 5 Aug 2011. Complete rewrite 4 Aug 2011. Based on 2 Aug 2011 Polish Project data. Three analysis files: IType.xls; IaType.xls; IbType.xls.

On Ysearch, I type is concentrated in Poland and in other Eastern European countries.

On 28 Jun 2011 Lukasz Lapinski suggested two small clusters based on recent I Borderline samples. These are currently called Ia and Ib types in the Polish Project. Ia and Ib are probably not really subtypes of I, as discussed in the following paragraphs.

I type seems to have structure. Some of the 67 STR markers are bimodal, which hints at subtypes. The bimodal markers are not correlated with each other, so I have not been able to identify subtypes with confidence.

My published 2009 definition for I type, I59, uses 59 of the 67 STR markers, cutoff 8. That definition still works quite well, with SBP 17.8% (Aug 2011). I consider SBP <20% sufficient to use the term type. I found a better definition, I62, cutoff 9, SBP 12.3%. The two definitions are compared in the file IType.xls. That 2009 definition had 22.4% SBP in 2009, so it did not quite qualify as a type back then. (Background means foreign samples with matching STRs that do not belong to the hypothetical I type clade; SBP is a high confidence statistical limit estimate.) Six of the 24 using that old definition are excluded by the new definition; if the latter is exactly valid that means background was actually 25%, which is close. The new SBP with the old definition is 17.8%, which is lower than 25%, but I’m comfortable with this because most of my published SBP’s have been shown to be larger than subsequent new data, as intended. The new definition also captures two samples that were previously borderline, one of which was classified I type anyway because that sample has close matches in I type. The new definition captures an A type sample; that sample is a good fit to A type; this false call is not incompatible with the 12.3% SBP which predicts less than 3 samples background (12.8% of 20). More about A type in a paragraph below.

The new I type definition lacks breadth - changing the number of markers increases SBP. This is displayed in Itype.xls as columns for different marker sets. For such analysis, the database needs to be restricted to the samples with step not too far beyond the cutoff. For I type the ranking of markers is sensitive to exactly where the database is truncated, so the automatic definition comes out differently for different truncation of the database. For the database in the Calculator sheet I truncated the database by removing samples at step > 13, except I left in two samples at steps 14 and 15 that had been classified Ib and IB (discussed below). The definition for I type is also sensitive to exactly which markers are assumed for the first iteration as the type. The TypeRank sheet in IType.xls uses the 19 I type samples, excluding only that one that A type. I tried quite a few other database truncations, and various assumed sets; those yielded different definitions with higher SBP. My published SBP formula is defined in a way that provides a larger number to compensate in part for such selection bias.

On the other hand, for the dozen or so samples that fit I type best, step < 7, the database and the number of markers do not matter; the same dozen or so samples are captured as I type for any reasonable definition using a wide breadth of markers. We can be confident that there is a valid cladecorresponding to those dozen best I type samples that will some day be captured as a haplogroup by a new SNP. Beyond those best dozen samples, steps 7 to 8, there are another dozen or so samples that seem to be I type but at lower confidence; the background might be significantly more than the best fit SBP. In my publication I explain why background increases very rapidly with step. I suppose the actual percent of background might vary from maybe about 1%at step 2, to maybe about 40% at step 8.

What does this mean? The simplest explanation: There was a “father” haplogroup thousands of years ago. Due to population bottlenecks, only a small number of the males from that father haplogroup are MRCA’s (ancestors of clades that exist today). The descendants of the I type MRCA participated in a significant population expansion. I type is the only large clade from that haplospace neighborhood showing up today in the Polish Project. Other smaller “brother” clades show up, and because there are many more haplotypes at larger step values, those brothers are randomly distributed at large steps in my I type analysis. This is a simple explanation; more complex explanations are possible - for example involving migration of tribes from distant lands.

IB are Borderline, at step just beyond the cutoff for I type, not fitting any other known type, with only about 50% confidence that they will someday end up in a haplogroup corresponding to I type. Samples are also assigned to I Borderline when the nearest matches at 67 markers are I type. There are two samples at step 10 (new definition) now changed from I type (old definition) to IB using the new definition. There are 4 more prior IB samples at steps 12 to 15 now changed to K and KB. The next update of the Results Table will show slightly smaller totals in I and IB.

As 67 marker data accumulates in the near future, it is likely a slightly better definition may turn up with even lower SBP, and I type may separate into subtypes with <20% SBP. The 111 marker data is promising (discussed in a following paragraph).

A clade that is very well isolated (<5% SBP) has a high chance of soon being defined by a newly discovered SNP haplogroup. For I type with 12.3% SBP, a new SNP might be older, including some small older clades, or a new SNP might be younger, leaving out some marginal I type small clades. For example, I recently discovered a new SNP in my own Y-DNA that is slightly older than my predicted type - see L540.

My maternal Iwanowicz grandfather was I type. This explains my extra effort analyzing I type. The two Iwanowicz samples are my maternal first cousin and a man that I found in Poland who seems to be my 4th or 5th cousin. Technically, one of those should be removed for slightly higher SBP because I recruited that data, but the bias for 20 samples is small (SBP becomes 13.0%).

One of the Iwanowicz samples was removed for the Results Table, along with editing of family sets in other categories.

SBP for Ia and Ib are 11.9 % and 17.0%. The definitions have breadth. These are good results, providing better than 80% confidence of validity for each. However, these all fall outside I type with my new definition. Even with my old definition, only 4 of these were I type at high step, the rest were IB. Using an I code was a bit arbitrary. Now is not a good time to change their code names, because quite a few new SNPs will soon be available. With more SNP data small types such as these can soon be renamed with more confidence.

Back in 2009, and still today, A type overlaps with I type at the margin. So does the newer D type. However, A type is coming out positive for the new haplogroup based on the L342 mutation, which seems to be rare in Poland. Mayka informs me that a WTY for one I type sample has come up L342-, as have two D type samples. In the past, I have always speculated that A type and I type are both subtypes of a larger K type. It now seems A type is really in a distantly related branch (L342) of the Y-DNA tree with similar STR values by coincidence. My prediction that I type is a subtype of K type is still a low confidence speculation.

The best ranked marker for I type is DYS578=9. DYS578 has the second slowest mutation rate of the 67 standard markers per the Chandler rates. The ancestral value is 8. The 9’s are colored orange in that analysis file IType.xls. From the 450 Polish Project samples at 67 markers, only 6 samples outside I type have the 9 value, one sample has a 7, the remainder are all 8, consistent with very few independent mutations. In the analysis file, notice that all the predicted I type samples have the 9 value with one exception, that A type (discussed above) at the last step of I has the ancestral 578=8. There are two A type with 578=9 at steps 11 and 12; the former has been tested L342+ (coded SNP results are in column BX of the file). All the other A’s have 578=8, so the obvious interpretation is an independent mutation to 9 within the A type clade. The only other 9 in that analysis file is an IB sample at step 12; that one might be another independent mutation; on the other hand, perhaps the mutation to 9 is much older than the TMRCA for I type, with that one sample representing a very small clade with an older node. The Ia and Ib samples all have the ancestral value 8; that’s evidence that Ia and Ib have old nodes with I - older than the 8 to 9 mutation.

The second best marker is DYS458=14, again orange in the file. This is a rapid mutator, so there is more variance. All but 2 of the I type samples with 578=9 have this 14 value. This is evidence of youth for I type. Those two, at 15 and 16, are probably independent mutations, although we cannot rule out the speculation that the 15 is the ancestral value telling us that the 458 mutation to 24 came after the 578 mutation.

Only 8 I type samples have 111 STR marker data and 2 of those are my Iwanowicz samples, so analysis at 111 is premature. That said, all but 1 of the 8 have DYS532=12; that one exception has 11. Value 11 also shows up for the one Ia sample, and for the two IB samples at 111 markers. DYS532 seems slow, but there are quite a few 11’s and 12’s in the 71 R1a samples at 111, so 532 will not displace 578 as the best marker for I type. Lapinski pointed out to me that a couple other markers also show promise at 111 markers for I type.

[Note inserted on 14 Sep 2011: There are now 9 I type samples and 7 of them have the signature (532,,504) = (12,14). All other R1a samples have the modal (532,,504) = (11,>14). This is evidence that the I type node with R1a tree is not much older than the M458 mutation. DYS532 and DYS504 are two of the new 44 markers in the extension from 67 to 111 markers. I'll call this pair of values the signature for a hypothetical IPN clade. This is not strong evidence, because there is a small chance those 2 mutations happened twice independently - in the M458 clade and in the I type clade. The two exception samples were previously classified Ia and IB, so they might be from branches older than the signature mutations. I need to update my analysis to include these 2 markers, and update this I type topic. I’ll be busy with other things for a few months, so I added this note.]

I modified the Ysearch I type definition, EKVHX for the new I62. I type has no samples at the step 9 cutoff in the Polish Project; on Ysearch there is only one Russian sample at step 9 (plus a couple modals), so I type is also well isolated on Ysearch, not just in Poland.

All 67 markers can be used for estimating the age of I type, because there are no significant recLOH problems with the compound markers in the I type data. Age comes out 1,208 years. See the ASD sheet in IType.xls. Raw ASD age is usually adjusted older due to population bottlenecks, as explained in my publication, but the adjustment should be small for I type because it is not very old and because I type obviously went through a population expansion. ASD age is highly uncertain due to caveats.

End of 5 Aug 2011 rewrite of I Type. Reminder: most of this web page has not been updated for quite a few months.

J. This type was suggested by Mayka. Only 6 members in the Polish Project, but this type is well isolated at SBP= 13%.

K. Update 20 Oct 2012. Since 2007, I had been using the name “K type” for a large R1a Polish category. Over the years I had subdivided K into several smaller types and clusters, although I did not have high confidence that all of them in fact belonged to a single unique clade, as discussed at this web page over the years. As types were subdivided, the remaining samples that fit the general K definition did not form a type. My K group is now known to be a mix of independent haplogroups, so the Polish Project stopped using K as a category in Oct 2012, although quite a few small clusters with names such as Kx and Kz are still predicted, because the confidence in the clusters has always been higher than the confidence in K. The various K categories are now clusters, types, and confirmed haplogroups within the two major haplogroup branches R1a1a1b1a2 (Z280) and R1a1a1b2 (Z93).

Most of the samples originally classified as K are now in B type, D type, E type (now part of Z92), H type (now equivalent to P278)), I type, and J type, all significant branches of Z280. If an SNP shows up that captures many of these branches, I’ll be inclined to use the short code name K to discuss that branch.

The Kurgans are the ones who domesticated the horse more than 6,000 years ago. Many scientist think that one pre-Kurgan man is the male line ancestor of all R1a1a men who live today. The Kurgan hypothesis is controversial, and not necessary for this web page. You may have noticed that I used the letters of “Kurgan” for my original types and categories during 2008. I know of no compelling evidence associating the Kurgans with what I call K, the largest part of R1a1a1b1a2 (Z280), but it’s fun to speculate that K became widespread during a Kurgan population expansion.

I have been using the subscripts “z”, “y”, “x”, etc backwards through the alphabet because I am running out of letters for new clusters and types. These small hypothetical clades seem to be subclades of K, although I do not have high confidence about the subclade status.

Kt, Ku, Ky. Cluster with STRs similar to K type. These came up Z92+, so their match to K type is a coincidence. Need documentation as a new topic at this web page.

Ky type was suggested to me by Mayka on 21 Dec 2010. There were only 3 samples in Ky last year; now there are 5.

That KyType.xls file demonstrates that the same 5 samples are extracted using any number of markers from 11 to 67, although at some of those definitions one or two other samples are also extracted. The full 67 markers work best, SBP=23%.

Ky was more isolated last year; a few samples showed up in the gap, reducing SBP.

I’m using a hand edited definition, Ky63, using 63 markers, for the following reasons:

Ky is unusual in that 4 of the 5 samples have an unusual value for at least one markers. I highlighted these values in red in that file. Notice also the high step values for those four, 8 through 11, using all 67 markers (column BX), although SBP came out 23%, which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 5 samples seems to be a representative of a branch of this hypothetical clade, where each of the 5 branches has a node not much younger than the TMRCA.

Hand editing like this does introduce some selection bias, so the calculated SBP=13.6% for Ky63 is misleading. Countering the selection bias, some if not all of those 4 markers that I masked out might represent small tribal sized subclades, so future prediction of new Ky samples should work better using Ky63 with those 4 removed. T

he far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out. You can see that my selection is a bit arbitrary; I could have masked less than 4, or more than 4.

ASD age using all 67 markers comes out 917 years, cell N12. ASD age using the 63 markers not masked out comes out 878 years, cell N29, not much less. ASD age has a number of caveats, and 4 samples are not significant, so this age is highly uncertain. Ky seems young, as haplogroups go.

Kz type was suggested to me by Mayka on 6 Oct 2010. Mayka speculates this might be a clade of Kazakh origin. There were only 3 samples in Kz last year; now there are 6.

That KzType.xls file demonstrates that the same 6 samples are extracted using any number of markers from 2 to 67, so the definition is not critical for this well isolated type.

Kz is effectively more isolated than the SBP values (row 12 in that file) indicate, because the samples just beyond Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.

I’m using a hand edited definition, Kz59, using 59 markers, for the following reasons:

Kz is unusual in that 5 of the 6 samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those 6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%, which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a representative of a branch of this hypothetical clade, where each of the 6 branches has a node not much younger than the TMRCA.

Hand editing like this does introduce some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but moot). Countering the selection bias, many if not most of those 8 markers that I masked out might represent small tribal sized subclades, so future prediction of new Kz samples should work better using Kz59 with those 8 removed. Again, this is moot, because any number of markers extract the same samples.

The far right of the “ASD” sheet has the markers sorted by apparent age, with “M” indicating the markers that I masked out. You can see that my selection is a bit arbitrary; I could have masked less than 8, or more than 8.

ASD age using all 67 markers comes out 724 years, cell N12. ASD age using the 59 markers not masked out comes out 704 years, cell N29, not much less. ASD age has a number of caveats, and 6 samples are not significant, so this age is highly uncertain. Kz is clearly young, as haplogroups go.

Additional information supplied to me by Mayka: Three of the Kz type samples are from non-Polish men who suspect they have Polish male line ancestry, so it is not certain Kz type is Polish. Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP appears to be private, restricted to his family. Insofar as that man recruited 3 more Kz samples into the Polish Project, Kz seems proportionally twice as large. My next edit of the Results Table will reduce the percent size of Kz.

Kz has the prominent signature DYS459b=18. Mayka points out the additional signature DYS461=12, not one of the 67 marker set; most of the samples in Kz have been verified with this 12 value. Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other hypothetical clades, we do not know if the signature markers define a larger father clade.

L. This cluster is highly hypothetical. It is rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested this cluster to me. It is a well known Scandinavian cluster. I quickly checked it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at 80% probability yet, so I am not yet using it for classification here. More documentation about L will be available here when I find time to study it.

L342.2. New topic 30 Oct 2011. This SNP was recognized as a new haplogroup by ISOGG during the summer of 2011. This was an L342 haplogroup category at the Polish Project for a short time in the summer and fall of 2011, but it has been replaced by Z93, because it seems all the L342.2+ samples are also Z93+ in the Polish Project. Apparently there are very few men elsewhere in the world found to be Z93+ L342.2-.

Z93 is a more reliable SNP than L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2, discovered earlier in the E haplogroup. L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests together are more reliable. These 4 mutations are in the same segment, which is apparently a segment that mutates relatively rapidly. Z93 is recommended as the better test for R1a samples that do not fit STR definitions of other R1a haplogroups; the Z93+ samples can do the L342.2 test. This information about L342.2 was supplied to me by Mayka.

The Z93 category has the samples that do not fit the two known subdivisions: A type and L342T cluster (next topic).

L342T. New topic 30 Oct 2011. Based on 26 Oct 2011 Polish Project data. Analysis file: L542TCluster.xls. I just noticed this cluster.

L342T is not a type, because SBP did not come out low enough. However, I included this cluster discussion here for the following reasons:

Seven samples at 67 markers fit my new 48 marker definition for L342T. There are 19 A type samples, which should all be in the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T; the closest A’s are at step 8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5 also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit the L342T definition) and the four samples beyond the cutoff, are assigned to K type and to subtypes of K; Z280 has recently become available for K type; as those background samples get tested in the future for Z280, my L342T cluster will start looking better. Let me say that another way: a cluster should be analyzed with data from its own haplogroup, so L342T should be compared only to L342.2 (Z93) data. But there is very little L342.2 (Z93) data available, so I used the full R1a database in that xls file. That means L342T is likely more isolated than it seems right now, so it is more likely to correspond to a valid haplogroup.

Mayka pointed out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code name. Of course, Tatars may belong to only a branch of L342T; I have no idea what fraction of L342T in Poland are Tatar. And of course Tatars are expected to be a mix of multiple haplogroups.

Three of the L342T samples, with the name Muchla, are apparently a family set, so they count statistically as only one sample, reducing the current count from 6 to 4, so SBP as calculated in that xls file should be increased (not as good). This is evidence against L342T being valid.

M. Needs documentation. M type was brought to my attention by Larry Mayka, who informs me others have called this haplotype the Viking haplotype because of its concentration in northwest Europe.

N. Comment 29 Feb 2012: See the M458 topic for discussion of a new SNP, L1029, that seems to be equivalent to N type.

Complete rewrite of this topic 25 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NType.xls

N type is concentrated in Slavic countries. N type is discussed in my publication, page 179.

According to Ysearch and Yhrd N type seems to be spread all around the Slavic lands and central Europe, common from East Germany to Russia. Within Poland N type seems to be about the same size as P type, both about 9% of men. Worldwide, N is much larger than P. N type should be properly studied in a database that is not restricted to Poland. However, there seem to be subtypes of N that are concentrated in Poland. See the discussions on N subtypes below. I’ll continue to watch the Polish Project, because it will be interesting if more data provide more Polish subtypes within N.

During review of my publication in 2009, the SNP called M458 was published. I added notes about this to my publication on page 184. The corresponding haplogroup is now called R1a1a1g. This haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-). M458+ samples may turn up someday that do not fit either N type or P type, but I have not noticed any yet.

N type age (age means TMRCA) is about 2,000 years. That’s highly uncertain, but I’m 80% confident that age of 2,000 years is not off by more than a factor of 2 - age 1,000 to 4,000 years. The M458 mutation is likely much older than the age of N type.

I’m suspicious that N type includes many younger clades that just happen to have similar STR values, difficult to resolve into clusters or types. I offer some speculation along these lines in the hypothetical subclade topics below.

I highly recommend that someone from N type purchase WTY, a commercial product for discovering SNPs. No sample from N type has been submitted for WTY. That means there is a good chance that the first N man to submit his sample to WTY will discover one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an SNP that captures about half of N type - or perhaps an SNP that captures a small subclade - or perhaps multiple such SNPs. My WTY was the first in a long time in my haplogroup, so I found 14 new SNPs.

It’s interesting to wonder why R1a1a1g seems to be composed of two types that differ substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type topic. Much of my P type discussion is also related to N type, so I avoided repeating all the details here; please read my P type discussion if you are interested in more about N type.

N seems to be older than P. I wonder if there are subtypes of N about the same age as P. I avoid too much speculation in this web page - just enough to indicate my motivation. I’m wondering if there are clades in various haplogroups, mostly P and N, associated with the origin of the Polish nation - a few centuries more than a millennium ago.

I have only identified 4 small subclades of N so far: I am quite confident of Ng type, but less confident of N-Ashk type. The Nt and Ns clusters are hypothetical; I have about 70% confidence in them. These 4 are used for assignments at the Polish Project web page. I also identified a few clusters with roughly 50%confidence; these are too speculative for formal assignments. All are discussed below. I made speculative assignments based on all these types and clusters within N type, in column CD of that file NType.xls, Calculator sheet. My file NClusterAssignments.xls has lots of details. If you are N type, you can find your row with your kit number, and see your speculative assignment. For the “clusters”, I estimate a 50-50 chance an assignment will need to be changed in the next year or so, as more data becomes available

In addition, N type has many bimodal markers, hints at yet more subclades not discussed here. This is evidence that N type experienced population expansion when it was young (not long after the TMRCA). More discussion below.

The paragraphs up to here are a brief summary. The rest of this topic is a detailed discussion about N type and hypothetical subclades:

This Sep 2011 analysis includes only data from the Polish Project. I’ll wait a few months before reviewing data outside the Polish Project. My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following topics, which have not been updated for several months:

For the size of N type, please see the table at the top of this page, where N has only 4 more samples than P (87 vs 83 - 5 Aug 2011 data). In my 2009 publication N had one less than P (28 vs 29, Table 6 page 169). The 70% confidence interval for 87 samples is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by implication in Poland) within statistical sampling accuracy, at about 9%.

My 2009 published definition for N type, N45, still works very well. I did not change that definition at my Jan 2011 update and analysis here in this topic. This Sep 2011 N46 update is just a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP data that has accumulated over the past year. Both definitions are compared in that analysis file NType.xls, Calculator sheet, columns BZ to CC.

Tweaking the definition like this, to better fit SNP data, introduces some selection bias. I discuss this issue in the P type topic, where I did a similar tweak; please read that topic if you are interested in the statistical justification. The justification is not as good for N type, so I’ll return to this issue in the “old branches” paragraph below.

This new N46 definition fails to capture only one M458+ sample, which falls at the cutoff step 8. This new N46 definition captures only one foreigner, L540+, at step 7, the last step of the type. The other samples at step 8 have tested either M458- or L260+, except one that fits D type well, so they are all confirmed as not N type. Similarly, 7 of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to other types, with only 2 that are Borderline fits to other types. In other words, the N46 definition captures the M458+ L260- samples with apparent 98% accuracy. However, myconfidence is about 80% for step 7, about 90% for step 6, and 95% or better for step <6. Again, please see the P type discussion about confidence for a general explanation. P and N are similar in this regard. I have related discussion about N type confidence in the “old branches” paragraph below.

Almost all the samples near the cutoff for the previous N45 definition have been SNP tested. This high testing rate is not a coincidence; Mayka and I have been encouraging men with marginal samples to do the M458 and L260 tests. (We paid if cost was a problem.)

The NType.xls analysis file has 10 columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as tentative N type definitions, with automatic selection of the best markers. For each column, I colored the step count violet for samples captured by that definition. You can see at a glance that any definition using 2 to 67 markers captures more than 80% of the N type (M458+ L260-) samples, and not many foreigners, so just about any definition works surprisingly well. In other words, N type is very well isolated in haplospace.

For the two best automatic definitions, I used boldface to highlight the N type samples missed by that definition, and also boldface to highlight the foreign samples captured by that definition. I used boldface similarly for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the effect of 3 different cutoff choices.

You might try resorting the sheet by column (select everything from cell A14 to the end) to better compare the results.

The issue of SBP is moot for N type now that the SNPs M458 and L260 are available, but an analysis is instructive: That NType.xls file has automatic marker selection of N type, and automatic calculation of SBP, disregarding the SNP data. The best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%. However, N46 is a better definition because N61 captures only 80 of the 87 N type plus that same one foreigner. But still, 8 misses out of 87 is not bad for N61, better than the 13.2% SBP (SBP is a high estimate for statistical confidence).

I considered calling N46 a definition for M458+ L260-, with a different definition for N type as a slightly smaller subtype, leaving out some samples that do not fit the N type definition with lowest SBP. I could not come up with a convincing definition for such a smaller subtype. So at least for now, I am considering N type as the same as M458+ L260-, with the understanding that may change in the future.

The summary conclusion for all those columns of trial definitions: My preferred N46 definition (column CC) does the best job of capturing N type (M458+ L260-). Most of the other columns are trying to define N type as slightly smaller, leaving out a few of the samples (not always the same samples). Most definitions for N type have many samples at or near the cutoff. My explanation is in the next paragraph:

Old branches: A type is a hypothetical unique clade. Of course, every clade is composed of subclades - branches in the Y-DNA tree. Here is a simple explanation for the previous few paragraphs of discussion: N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type MRCA at a few STR values from the standard 67 set. Those old branches have many younger branches (twigs) that differ at yet more STRs. In other words: the N tree might have a few small branches near the ground. Those small old clades provide samples in the database with large step, but each sample is from a different twig, so these do not correlate into obvious clusters. Any clade has statistical outliers with large step; a few small old branches would provide more outliers for N.

Those old branches may not be small world wide. One possibility - a large subclade of N concentrated outside Poland might have one small branch in Poland, corresponding to a man or tribe that moved to Poland long ago. I am watching for evidence along these lines, but so far this paragraph is speculative.

In addition, there might be additional large old subclades that seem young. I consider this possibility in the discussions below. The age of a clade can be much younger than the node. I discuss this in another topic, where I call such clades smooth branches. The N tree might have a number of small smooth trunks with nodes near the ground - that would not necessarily be evident as STR correlations. On the other hand, the N tree might have only one main trunk, almost smooth, with only few small branches near the ground. The actual situation might be more complicated, with multiple trunks of various sizes, at various distances from the ground. I can’t tell yet from the STR data. Perhaps another year of additional STR data may help.

Why am I speculating about N type smooth branches? I see plenty of hints for more branches in the N type data, but little statistical confirmation. In the discussion below for subclades, I offer evidence (not definitive proof) for many more significant sub clades within N type.

This discussion is personal. It is my opinion, based on my statistical analysis. Someone might send me an email any day now pointing out a convincing cluster or type in N that I missed. Someone else might disagree with my analysis about particular hypothetical N subclades.

Reminder: This discussion is limited to Poland, as represented by the Polish Project. Outside Poland there is additional probability of M458 branches showing up someday that fit neither N type nor P type. Outside Poland I expect yet more N type branches.

Regarding concentration in Poland, I use percent of samples in Ysearch with “Origin” Poland as an objective measure. This is discussed in my publication, where Table 1 shows P12 (the P type modal haplotype using only the original standard 12 markers from the Polish Project) with 42%, while N12 has only 14%. Those numbers 42% vs 14% are not calibrated (because of the unknown concentration of men with Poland origin in Ysearch) but those numbers are a relative indication of concentrated in Poland vs not particularly concentrated in Poland. My file NYsearch.xls has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift due to more data. That same file has the N46 definition at 24%. This is evidence that N type, defined using 46 of 67 markers, is only slightly more concentrated in Poland than the 12 marker equivalent. The simplest explanation: There are probably large M458 clades outside Poland that match N12 and also match N46 at less than the cutoff, but the Polish samples are only a twigs on those branches, descended from one man or family or tribe that moved to Poland a millennium or so age. It makes sense that clades within M458 might be regionally concentrated. That 24% concentration for N46 is of course an average; there are subclades of N with higher and lower concentration. I found a few, discussed below; that file NYsearch.xls has a sheet for each subclade analysis.

Age: N type comes out 2,340 years old using all 67 markers. See cell N12 in the ASD sheet in NType.xls.

Because of recLOH issues, the compound markers 464, YCA, and CDY present difficulties estimating age in the N type data. Other compound markers are OK. The ASD sheet allows a mask, row 21, where I masked out the 8 markers for these recLOH difficulties. The result, using 59 markers, cell N29, is 2,010 years. That’s my best guess for the age.

On the far right of the ASD sheet I sorted the markers by apparent age. YCAb comes out 20,704 years, demonstrating the recLOH problem.

The second (apparently) oldest marker is DYS454, at 18,744 years. This old age is due to only 5 mutations in this slow mutator. DYS454 is clearly bimodal. In my notes, I use the Nj code for the 2nd mode with these 5 samples, defined by 454>11. This is evidence of a subclade, but the statistics are not convincing yet. Maybe with more data in the near future I might call some of these samples the Nj cluster. It’s not fair to exclude this “old” marker, DYS454, because there are 7 markers with zero age (no mutations in the 87 samples) and there are 7 more markers with less than 1,000 years apparent age. The reason for averaging markers is that apparently old markers should be averaged out with apparently young markers. Anyway, you can go ahead and mask out DYS454 by deleting the mask number at cell AE21, and the new age (58 markers) without 454 is 1,990 years, only a 20 year decrease. I offer this paragraph of discussion as one example of preliminary evidence of an N type subclade, based on 454<11.

The third oldest marker is DYS531, at 14,319 years; at this bimodal marker I use the code Np for the 2nd mode value. Again, I’m waiting for more statistical evidence for a subclade.

That far right side of the ASD sheet has more notes about markers with old apparent age.

Age estimation from STR variance is highly uncertain. At another of my web pages, I use M458 as an example of age caveats. I have more discussion about age estimation methods in the P type topic; please read those two topics if you would like more discussion; N is similar to P in this regard.

I’m not too concerned about getting the age of N type correct in Polish data because I suspect in less than a year there will be enough evidence to subdivide N - new SNPs and / or more STR data for better statistical significance. I suspect there will be younger subclades. Furthermore, M458+ L260- is not really a tree; it seems to be branch of the Y-DNA tree that is well isolated - a long smooth segment near the node; but I mentioned above my suspicion that the main branch might not be really smooth - there might be significant old branches concentrated outside Poland; if this is true I’ll need to soon redefine N type as younger, excluding any such significant branches. I’ll leave it for someone else to estimate the age of M458+ L260- from worldwide data; I’ll concentrate on N type, and hypothetical sub clades in Poland.

There are 12 samples from N type available with the new 111 STR marker set (18 Jul 2010). Only DYS532=12 is an obvious signature marker for N type from the 44 new markers; 10 of the 12 have this value. Modal for R1a is 532=11. P type also has the 532=12 value, also 10 of 12 samples, so this marker also provides a signature for M458 with good statistical significance. I type also has the 532=12 value; see the I type discussion below.

The following topics are my proposed subclades for N type in the Polish Project. Please consider reading the section P Type Bimodal Markers, if you would like more discussion of how I use bimodal markers as hints for subclades; that same discussion applies here for N type. If you are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls. Near the bottom of the “Haplotypes” sheet is a list of 70 code names for signatures that I considered for N type subdivision. I discuss only a few of these here. I spent a lot of time studying tentative subclades of N because I’m anxious to find significant subtypes that are concentrated in Poland. I uploaded a total of 17 Excel analysis files associated with N and tentative subclades, all discussed above and below.

Ng. Rewrite finished 22 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NgType.xls. Ng is a small subtype of N type, but it has highest confidence.

This is a very small subtype, only 3 samples, but it is very well isolated. The definition uses 56 markers, cutoff 4, gap 9. There are no samples in the gap, from step 4 to 12. SBP = 15.8%.

These same 3 samples are present in Ysearch, where the gap with no samples is from 4 to 11. Two samples at step 12 are from Germany and Unknown. There are none at step 13 and 11 samples at step 14. It seems Ng is concentrated in Poland.

The signature is (537, 492) = (10, 14). These are the only 3 Polish Project samples in N type that have any mutation from the 12 value at 492, and they have a 2-step mutation. 492 is ranked 18th of 67 in the extended Chandler mutation rates. The 10 value at 537 is also rare - only these 3 plus 2 other samples have it in N type in the Polish Project. The same 3 Ng samples are extracted from N type using 1 to 67 markers. They are well isolated using as few as 7 markers because they have little variation from each other in the rapidly mutating markers, so those rank well for the Ng definition. ASD age comes out 619 years using all 67 markers but of course that is a very rough estimate.

The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and passed on those 2 unusual mutations.

The 3 Ng samples fall at steps 4, 5, 6 with the N45 definition of N type, a hint that the Ng node is near the center of the N type branch, not one of those old branches I speculated about, but this is just a preliminary hint.

I introduced Ng type in Oct 2010; there have been no new 67 marker data in the STR neighborhood of Ng type, so SBP has been 15.8% since, with the same definition.

The “g” is only my arbitrary code name that I have been using for the DYS492=14 signature.

N-Ashk. Rewrite finished 25 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NashkType.xls. N-Ashk is a small subtype of N type. Only 4 samples.

These seem to be Ashkenazi samples. Mayka pointed out to me that the names seem Ashkenazi, per his experience. The samples beyond the cutoff are apparently not Ashkenazi.

I introduced this type in Jan 2011, with SBP 23%, slightly more than my stated 20% limit for using the word type. Two reasons: First, the Ashkenazi names are independent evidence of a clade. Second, the N-Ashk modal haplotype differs from the N modal at 6 markers, which is evidence of a fairly old node in the N branch of the Y-DNA tree.

I introduced this type as Nca type, because of what I have been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was confusing because the samples do not match what I have been calling the Na marker. Nc is large; I doubt N-Ashk is a twig in a large Nc branch; the Nc mutation more likely arose independently in the N-Ashk hypothetical clade.

This Sep 2011 reanalysis makes a cleaner cluster of data, although still small with only 4 samples. The 594=11 marker is very clean; these 4 samples are the only R1a samples in the Polish Project with this value. SBP increased to 47%, so it is a stretch to call this a type, but the Ashkenazi connection is improved now and the 594=11 marker seem to be strong evidence. Also, I avoid making changes in classification names without significantly more data, so I’ll continue to call this a “type” for now. There are no longer any N-Ashk Borderline samples at 67 markers; the Borderline category is used for apparent Ashkenazi samples that match well with only 37 markers.

The improved definition uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4. (The previous definition used 59, cutoff 5.) The improvement: I masked out CDY. The previous definition used CDYb, missing an Ashkenazi sample that fits the type well, but has recLOH, providing a misleading step of 5 at this one marker. With that new sample the ranking of markers came out slightly differently, so a few other markers were added or removed from the definition. The old and new definitions are available in NashkType.xls. The new definition is also available at Ysearch as 2TZKF, and in my Haplotypes.xls file.

The ASD age comes out only 668 years, cell N29 in the ASD sheet in NashkType.xls. Age calculated from only 4 samples is highly speculative, but N-Ashk seems young because of little variation in marker values. The ASD should use (4-1) in the denominator instead of the total 4 samples (although most genetic genealogists do not do this for small sample sizes); with that adjustment the age comes out 890 years, but that is still highly speculative. That cell N29 is using 61 markers; CDY and 464 are masked out. (The mask is row 21, which you can easily edit.) All 67 markers yield 1,024 years, cell N12, because of CDY. DYS464 has no mutations in the set of 4, so including those 4 reduce the age, but I left 464 out because most people routinely exclude the 464 set from ASD.

N-Ashk is quite young, but the node seems old because of the 6 marker distinction from N type. The simplest explanation: N-Ashk has a long smooth branch, having an old node with N, but no further branching near that main node. The samples in the Polish Project all seem to come from twigs with young nodes. I speculate that there may actually be some branches of N-Ashk outside Poland. Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat less than a millennium ago. More data will eventually confirm or refute this speculation.

2TZKF is the modal haplotype at Ysearch, where only two of these samples are present, and where there are 2 additional samples in the gap, from Russian and Belarus; the simple explanation is that N-Ashk is concentrated in Poland, although there is too little data for confidence. See NYsearch.xlsfor my Ysearch analysis.

Nt. Edited 25 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NtCluster.xls.

With 17 samples, Nt cluster is my largest speculative subclade of N type identified so far.

SBP = 27%; this cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions of this Nt cluster due to selection bias: I considered 70 signatures for N type during the summer of 2011, and carefully analyzed more than 30 of them. With that many attempts, a false positive is likely. One of the clusters I analyze will necessarily have the lowest SBP, but that might be just the luck of the data. No one knows how to calculate the statistical confidence in such a case. I discovered Nt at the end of this major effort. If SBP improves with more data for Nt I’ll upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll loose interest in Nt.

If Nt is valid, it is probably concentrated in Poland. See NYsearch.xls. See my Ysearch method discussed above. I consider this additional evidence that Nt corresponds to a clade, boosting my estimated confidence to about 70%. We don’t always use 70% confidence for assignments, but everyone is anxious for more subdivision of N type in the Polish Project, so we started using Nt in Sep 2011.

The Nt definition uses 48 markers, cutoff 4, one sample in the gap at step 4. The definition is available at Ysearch as 2544E.

Nt is based on the signature DYS442<14. However, there are 29 samples with that signature, and 5 of the 17 Nt cluster samples have the N modal 14 value at this marker. My simple speculative explanation: the 442 mutation from 14 to 13 occurred independently in the Nt clade after the node with the main N type branch. Other speculative explanations are possible - those 14’s might be a back mutation within a much larger “father” clade that carries the Nt signature on most but not all samples.

One Nt cluster sample has the 12 value at 442, which could be another mutation or an independent double mutation.

If we subtract the 12 Nt signature samples with <14, that leaves 17 more samples (not included in my Nt cluster) with this second modal value at 442. There are only 3 samples at 15 in all of N, and we expect step up to be more common than step down for a slow mutator (see my publication for references), so that still leaves an excess of samples with <14, implying yet another hypothetical clade with an independent mutation, or a larger “father clade” but this paragraph is getting highly speculative. I have more speculation like this about independent clades vs large clades in the Na, Nb, and Nc topics below, similar speculation applies to Nt.

Thirteen Nt samples match what I call the Na signature, discussed below, but two samples match the alternate mode Nb; the last two samples are one step away from Na. This is evidence of an even larger Na father clade, but as discussed below the Na vs Nb signatures may have arisen multiple times independently, so I’m not confident to speculate further along these lines.

Ns. Edited 23 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: NsCluster.xls. Ns cluster is a speculative subclade of Nt cluster.

With 6 samples and SBP = 27%, this cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions of this Ns cluster for the same reasons given above for Nt: On the other hand, Ns looks like a credible subclade of Nt, which adds credibility to both of them.

If Ns is valid, it is probably concentrated in Poland. See NYsearch.xls. The 67% concentration is the best I have seen so far, but this % is highly uncertain because it is based only 2 Ns samples at Ysearch. Such as it is, I consider this additional evidence that Ns corresponds to a clade, same as my confidence for Nt.

The definition uses 47 markers, cutoff 2, no samples in the gap at steps 2 and 3. The definition is available at Ysearch as A5NSG

Ns is based on two signatures. Ns is my code for DYS446=12, 9 samples, vs 446=13 modal for N type. Nt is my code for DYS442=13, 5 samples, vs 442=14 modal for N type. The 6 Ns samples are all at steps 0 and 1 with the 47 marker definition; the other 3 with that signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation happened twice independently in the N type clade. Five of the 6 Ns samples have the Nt signature, but that 6th one has the value 12, two steps from the N modal 14, so it should be considered Nt also.

Nd. Edited 24 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: Nd53Cluster.xls.

Based on the signature DYS389I = 14, vs N modal 389 = (13,29). Nine samples have the Nd signature. Only 3 of these 9 fit Nd53. My confidence is only about 50% that these 3 samples really belong to the same clade; I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic.

DYS389II has the value 30 for Nd but this is not a mutation at 389II. See compound markers for an explanation.

I call this Nd53 because the 53 marker definition is somewhat arbitrary - there is no very likely definition. It is likely I’ll need to change the definition soon, when more STR data becomes available. Also, “Nd53” makes it clear that this is not the same as the cluster formed using only the Nd signature.

Nd53 is not used for assignments in the Polish Project; see NclusterAssignments.xls for speculative assignments.

The 3 samples do not have Poland as origin, although I suppose those men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples. On the other hand, Nd53 might be representative of a clade that is concentrated outside Poland.

Ne. Edited 24 Sep 2011. New topic 23 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: Ne40Cluster.xls.

Based on the signature DYS390 = 24, vs N modal 25. Twelve samples have the Ne signature. Only 3 of these 12 fit the Ne40 cluster. My confidence is only 50% that these 3 samples really belong to the same clade; I included this analysis as an example of an uncertain clade, and for discussion below in the Na topic. Nd and Ne have similar status.

I call this Ne40 because it is likely I’ll need to change 40 marker definition soon, when more STR data becomes available.

Only one of the 3 samples has Poland as origin, although I suppose the other two Ne men have suspicion of Polish ancestry, because that is usually the case for Polish Project samples. On the other hand, Ne40 might be representative of a clade that is concentrated outside Poland.

Na and Nb. I have been rewriting this topic throughout the late summer of 2011. Finished 24 Sep 2011. Based on 5 Aug 2011 Polish Project data.

I introduced Na and Nb in my publication, page 179 and Table 3. I have been updating the discussion for Na and Nb here at this web page. I consistently emphasize that these are speculative subclades. In retrospect, I should have avoided the word “type” for these because more data over the years has convinced me that the explanation for what is going on is not two subtypes of N. It will take me a few paragraphs to explain the issue of Na and Nb:

One way to split the N type data, obvious at a glance, is by the number of markers for 464. Some samples have 4 values, some have 6, just a few have 5 or 7.

I understand that the 464 set is the most prone to genetic testing evaluation errors, so this or any categorization using 464 will have uncertainties. If 464 is taken in combination with other markers that means some statistical uncertainty due to possible evaluation errors at 464. Specifically, a sample in a database with 4 values at 464 might really have 5 or more values, and vice versa.

Follow my links if you wish to read more about compound markers and recLOH issues, which introduce confusion for the 464 marker set. Briefly, copy mutations can increase the number of 464 markers, but recLOH mutations might reduce the number. A single copy mutation can change more than one value in the set. Copy mutations and recLOH mutations are rare, about the same frequency as very slowly mutating STR markers. Net mutations in the 464 set are common, with frequency among the fastest in the standard 67 set. For the Chandler rates, each of the four markers 464a to 464d are assigned a rate 1/4th the net rate for single mutations for the set of 4.

I use Na as my code for the signature 464 = (12,12,15,15,15,16) - the most common value set for 464. 28 of the 87 samples. My Nb signature is the next most common, 464 = (12,15,15,16). 16 samples. I say 464 is multimodal because there are also two sets with 4 samples each; that’s why I’m using Na as a signature even though it is the modal value for N type as a whole. This is for the 87 N type samples in my 5 Aug 2011 download of the Polish Project; the proportions change every few months as data accumulates due to the statistics of small sample sizes.

Na and Nb differ by 2 steps following the Ysearch method, but that is misleading because Na can turn into Nb in a single recLOH mutation, which might have happened more than once in the past in this N type database. Nb can turn into Na with a single copy mutation. I may not be exactly correct in this paragraph if my assumption of the structure of 464 in N type is incorrect, but this paragraph is certainly a brief example of the kind of confusion that arises with 464.

It is easy to construct clusters using 464 in N type. Too easy. Too many choices for clusters, as I discuss in the following. I could not come up with clusters with good statistical confidence. My Excel analysis files allow setting maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any variation of a sample from a trial definition; still I found no clusters with confidence.

My analysis files allow an alternate method, treating the 464 markers as individual markers. This is the method I used in my 2009 publication, still no clusters with confidence.

My default is to follow the Ysearch method for counting step at 464, although this method is obviously less than perfect.

When trying individual markers, DYS464b is best. In my notes I use Na1 - 464b<14, and Nb1 - 464b>13; these two signatures neatly split all the N type data. Na1 captures all the Na samples plus mostly samples with more than 4 markers; Nb1 captures all the Nb plus mostly samples with 4 markers; there are exceptions. Using Na1 vs Nb1 I come to the same conclusions as using Na and Nb, discussed below.

DYS464e provides another way to split the data. In my notes I use Nx - any value for 464e, and Ny - no value for 464e. Nx captures all the samples with more than 4 markers including the Na samples; Ny captures all the samples with 4 markers including the Nb samples. Using Nx vs Ny I come to the same conclusions as using Na and Nb, discussed below.

Consider my definitions Na45 and Nb32, with 45 and 32 markers. See those two Excel files for details. My choices for 45 and 32 are arbitrary. Those files show columns with trial definitions using a wide range of markers, automatically chosen by rank. A wide breadth of number markers seem roughly equivalent. It is remarkable how many samples fit very well using up to 50 markers for trial definitions: Na has 16 samples at step zero using 11 markers, and 15 samples at step less than 2 using 45 markers; Nb has the same 14 samples at step zero using from 11 to 32 markers. When the 464 set is excluded from the definition, some Na samples fit the Nb definition, and some Nb samples fit the Na definition. One simple explanation: Na45 and Nb32 might correspond to two very young clades. However, there is an alternate explanation: Na45 might correspond to two or more young clades, and Nb32 might correspond to two or more young clades, and they may be a “bushy” set of branches where some Na45 clades are connected by nodes to some Nb32 clades. I see no way to be confident that most of the Na samples are in a branch distinct from a branch with the Nb samples. I suppose if your sample matches Na45 at step zero or one, there might be better than a 50-50 chance that you and others who match at <2 belong to a unique clade that may someday have an SNP definition, but such a clade will surely exclude some of the step <2 samples, and include some samples from steps 2 and 3, so Na45 does not provide a definition. The same can be said if you match Nb.

Some samples that fit the Na signature at 464 = (12,12,15,15,15,16) come out at high step using more markers. Similarly, some samples that fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more markers. You can see this at a glance in those two files. Two opposite simple explanations come to mind: Na and Nb may have independently arisen more than once, followed by population expansion - multiple branches in the N tree. The opposite explanation: Na and Nb sets might be signatures for two old clades that each have a few old subclades - two main N branches that have a few old branches and where both Na and Nb have a bushy clump of branches at the ends. More complicated explanations also come to mind. That second explanation, two main branches, is attractive, but I see no proof that is true, or even highly likely.

In the file NclusterAssignments.xls, I make speculative assignments. Most of the Na45 and Nb32 samples fit other more believable types and clusters. I went ahead and assigned the few leftovers to Na and Nb, but these are just speculative assignments, meant so show you which of my clusters you best fit.

Summary: There is not enough evidence to consider Na and Nb to be two unique subclades of N. Maybe Na45 and Nb32 do correspond to the top of two main branches of the N tree, with most of the samples that fit Na45 or Nb32 belonging to the corresponding clades. Maybe not. I see no way of ruling out multiple independent clades (branches far apart in the tree) for both Na45 and Na32, or for any other definitions based on the 464 set. Perhaps in a year or so more STR data will provide convincing sub cades along these lines. Perhaps in a few years SNPs will be discovered to subdivide N type.

At all 67 standard markers, the Na and Nb modal haplotypes are essentially the same for STR markers other than 464. I say “essentially” because the rapid mutators, particularly the CDY pair and DYS576, typically vary modally from month to month due to the statistics of small samples. At CDYb, Na type signatures with multiple markers are typically modal 40, while Nb are typically modal 39, but this marker always ranks poorly for definitions because of the wide range of values. In Nb less than 1/3 of the samples typically have the modal value at CDYb.

The Russian site independently came up with this same haplotype distinction. Two modal haplotypes are available on Ysearch, from the Russians. Each use 78 markers and each match my Na and Nb types at 67 markers, including that 39 value for CDYb in Nb. Central European-1 Modal GTAVRcorresponds to my Nb, using only 4 values, 464a-d. Central European-2 Modal 495M5 corresponds to my Na, using 6 values, 464a-f.

Nc. New topic 25 Sep 2011. Based on 5 Aug 2011 Polish Project data. Analysis files Nc32Cluster.xls

My Nc code is for the signature DYS19 = 15, compared to the modal value of 16. Similar to Na and Nb, my publication and previous versions of this web page proposed Nc as a tentative subdivision cluster of Nb. The samples with the 15 value last year had mostly Nb samples, but this year that correlation is not significant.

My opinion of Nc is very similar to my opinion of Na vs Nb: No confident conclusion. Nc might correspond to a single large clade. Then again, Nc might correspond to independent unrelated clades where the Nc mutation arose independently.

My Nc analysis complements my Na and Nb analysis: If you look at Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na and Nb. If you look at Na45Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16. If you look at Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc and modal DYS19=16. If Nc32 vs modal 16 is a valid division of N type, then Na vs Nb cannot be valid. If Na vs Nb is valid, Nc vs modal 16 cannot be valid. All three files have, at the bottom, at large step, some Na, Nb, and Nc samples.

Nbc42Cluster.xls is my analysis file using both the Nb and Nc signatures together.

Nac32Cluster.xls is my analysis file using both the Na and Nc signatures together. This is very different than Nc32; the latter has a mix of Na and Nc; the former is a new analysis using the additional restriction to Na match. They both have 32 markers by coincidence. As in Na45 and Nb32, the number of markers is my arbitrary choice; there is no obvious best choice; the number of markers will likely change as data accumulates for all these definitions where I specify the number of markers in the code name.

Nb5_37Cluster.xls is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers plus the modal value at DYS19.

Na7_26Cluster.xls is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers plus the modal value at DYS19.

In the file NclusterAssignments.xls, I make speculative assignments to these 4 clusters, but samples that fit one of the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more confident assignment if they also fit these 4 combinations.

The 3 Ng samples are all Na, but they are a mix of values at DYS19. The neighborhood (just beyond the Ng cutoff) is all Na. This is a tantalizing hint of a “father” clade with the Na signature.

The 4 N-Ashk samples are all Nb, but in this case the neighborhood is a mix of Na and Nb. This is a hint of an independent mutation to Na somewhat older than N-Ashk. Three of the 4 N-Ashk are Nc, as are most of the neighborhood. The other has the modal DYS19=16 value. This is a hint of a father clade with the Nc signature, DYS19=15, plus recent back mutations to the modal value.

The 6 Ns samples are all Na, with a neighborhood mostly Na but some Nb. The Ng, N-Ashk, and Ns samples are all very far from each other. You can see this in the file NclusterAssignments.xls, where each type and cluster has a column, with step value for each samples. I consider this strong evidence against a large Na clade; it seems more likely that the Na (464=12,12,15,15,15,16) set arose independently by copy mutation 3 times in these three hypothetical clades.

Nt, the purported father of Ns, has 17 samples; 13 Na signature, 2 Nb, 2 one step away from Na. It is reasonable to speculate that those 2 Nb are due to an independent recLOH in Nt, and that the father clade has the Na signature. Unfortunately, it is also reasonable to speculate that there were multiple mutation to the Na signature within Nt making the 464 set is irrelevant.

The 3 Nd samples match Nb but again the immediate neighborhood is a mix of Na and Nb, again evidence for independent mutations at 464.

Ne is another example of a mixed Na Nb neighborhood. In this example, 2 of the 3 match Na. That third one, 464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a hint at another copy mutation.

NYsearch.xls has a sheet with Ysearch data analysis for each type or cluster. The Polish percent, in boldface, is my important result. Although this analysis is based on very little data for each of those 4 combination clusters here is the tentative finding: Nbc42 is not concentrated in Poland. The other 3 seem to be concentrated in Poland; that is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors one or more clades that are concentrated in Poland.

Ns seems related to Nac7_26, because 4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3, so this technique of 4-way combination is good for hints, but not conclusive.

Summary; Na, Nb, and Nc clusters: 25 Sep 2011. That was a lot of analysis to justify my opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without correlation to more markers. N type probably experienced population expansion not long after the TMRCA, whereby the main N branches come out today with similar STR distributions. DYS464 is multimodal; DYS19 is bimodal; the 4 main combination modes based on 464 and 19 provide evidence of twigs that are concentrated in Poland. I bet there are many more small Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type. I’ll continue to watch the STR data. New SNP markers within N type someday will be even better.

P. Complete rewrite finished 16 Aug 2011. Based on 5 Aug 2011 Polish Project data. Analysis file: PType.xls

P type is the main topic in my publication, Part II. P type is significantly concentrated in Poland, and in the Czech Republic. It is found at lower frequency in other Eastern European countries, and in eastern Germany. About 9% of Polish males carry P type Y-DNA.

After my publication, an SNP called L260 was discovered, found to be equivalent to P type, confirming my prediction that P type corresponds to a haplogroup, R1a1a1g2.

The “father” haplogroup R1a1a1g (M458) is composed of what I have been calling N type (L260-) and P type (L260+).

P type age (age means TMRCA) is about 1,600 years. That’s highly uncertain, but I’m 80% confident that age of 1,600 years is not off by more than a factor of 1.5 - age 1,100 to 2,400 years. The L260 mutation is likely quite a bit older than the age of P type.

It’s interesting to wonder if the age of P type is associated with the historical appearance of Poland somewhat more than 1,000 years ago. It’s also interesting to wonder why P type is so isolated in haplospace - why there are so few men alive today with STR values slightly different than P type. I added a bit of speculation along these lines to my publication, but frankly, no one knows the answers. I offer a little more speculation at the end of this topic.

My published 2009 definition for P type, P36, still works very well. My prior update definition, Sep 2010, P46, still works very well. I updated the definition Aug 2011; P43. All 3 definitions are compared in that analysis file PType.xls, Calculator sheet, columns BZ to CB.

The August change is only a slight tweak; I dropped 3 slowly mutating markers that are mutated in two samples recently found L260+; these two were at steps 7 and 8 using the prior P46 definition; they are now at steps 5 and 6 with the new P43. More discussion about this below.

There is only one L260+ sample not captured by P43. This sample is at step 9 using any of my 3 definitions. The problem is DYS464, where this sample obviously had a serious recLOH mutation, expanding the number of 464 markers from 4 to 6, yielding step 4 for only that compound marker. The net step 9 would become step 5 without 464. Nevertheless, I cannot drop 464 from my definition, because this marker helps a lot to discriminate P type from N type. I have more discussion below about this outlier sample.

P43 captures only one sample not P type, an NB sample, which means N Borderline. Although this sample fits N better than P, hence the NB prediction, it has not been tested for L260 or M458, so its status is uncertain.

There are 10 samples at step 6 (5 Aug 2011), the last step of the type, where uncertainty is highest. Seven of these have been tested L260+, confirming membership in this haplogroup. This high testing rate is not a coincidence; Mayka and I have been encouraging men with marginal samples to do the L260 test since it became available in Apr 2010. (We paid if cost was a problem.) One of the step 6 samples not L260 tested is the NB sample of the previous paragraph. Another is M458+ and not a fit for N type, so it can be confidently predicted L260+ (although the L260 test would be nice). The 10th step 6 sample has neither SNP test, and is not a fit for N type, so it is assigned PB, a Borderline assignment intended to encourage SNP testing. There are two other PB samples that were step 6 using the prior definition; these are now step 5. We will probably expand the PB category, so the next assignment update should have a few more PB samples, again to highlight the ones most likely to benefit from SNP testing. I estimate the PB samples have about 75% probability of being proven L260+.

P43 summary: The P43 definition, cutoff 7, captures 90 samples as P type. One L260+ sample is not captured because of DYS464. One captured sample at step 6 is probably N type. So the predicted P type is 90 samples and the predicted (some actual) L260+ is also 90 samples (5 Aug 2011).

The statistical accuracy of my P type definition may seem like about 98% - 100% below step 6. However, my confidence is more like 90% - I’m 90% confident that more than 90% of future samples that match P43 below the cutoff step 7 will be L260+ if tested - - 95% confidence below step 6. That confidence is not calculated - it’s my estimate to account for two issues: First, I have removed from the definition markers that are mutated only for L260+ samples at high step (mentioned above and discussed further below) but more such mutated markers are bound to show up for future samples, so future predictions are not quite as good as the adjusted fit implies. Second, there may still be a very small L260- clade that just happens to have STR values close to P43 due to the luck of random STR mutations. For samples without Polish ancestry the probability is higher for these two issues; this confidence discussion is limited to Poland, as represented by the Polish Project.

According to Pawlowski, along with further evidence in my publication, P type (L260+) is concentrated in Poland. I verified this and other Polish types using both Yhrd and Ysearch. P has fewer mutations than N and K, so it must be younger. In my publication I estimated that about 8% of Polish men have P type male line ancestry of this type; my current estimate, from the Results Table, is 9.0% (calculated from the edited data 28 Jul 2011) -- calculated 70% confidence interval 8.0% to 10.0% -- 95% confidence interval 7.1% to 11.0%.

Ludvik Urban pointed out to me that P type is common in the Czech Y-DNA Database. FTDNA also has a Czech Y-DNA Project. There is not enough data yet to calculate if the frequency in the Czech Republic is greater or small than the approximate 9% frequency in Poland (as represented by the respective projects).

Karen Melis, administrator of the FTDNA Zamagurie Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on the border of Slovakia with Poland. I’m not sure of the concentration in Slovakia.

It will be interesting if more data in the future allows resolution of subtypes of P type by region.

I added a “Ysearch” sheet to that PType.xls analysis file, with update analysis from Ysearch. That file has a copy of the 123 matches at step < 9 (12 Aug 2011) from my P43 definition, 8U92G. Seven of those matches are modals, segregated to the bottom of the sheet and not used for analysis. The cutoff is 7, same as in the Polish Project, but SBP is 19%, not very good. The reason is 10 samples at step 7. Only two of these at 7 indicate “Poland” for Origin, 3 Germany, 2 Scotland, 2 Unknown, and 1 USA. This may be a sign of a clade outside Poland with STR values close to the P type cutoff; I doubt that; more likely, these are outliers from more distant clades, because there are a huge number of samples at step >9 so of course some samples from those clades will fall at step 7 just due to the luck of random mutations. In other words, P type is a relatively small haplogroup on Ysearch, and thebackground is larger on Ysearch than in the Polish Project, so of course SBP will be larger. Still, 19% is pretty good on Ysearch.

Those Ysearch results include 11 samples with “Unknown” or “USA” for Origin, so I removed those for Origin analysis, 105 net samples. Below the cutoff step 7, 54% are Poland; that is very high; the overall percent of samples in Ysearch from Poland is a very low percent. At steps 7 and 8, 26% are Poland, showing the expected drop off for outliers. Germany and other Slavic countries also have significant percent P type; there is a table with details in that Excel sheet. This updates my evidence that P type (L260+) is concentrated in Poland.

The isolation of P type in the Polish Project is now even more impressive than at the time of my publication. Most of the samples at steps 7 and 8 are good fits to other newly discovered types (see PType.xls, column CB), so there are now fewer borderline samples just beyond the edge of P type. Two of the step 7 samples are my maternal cousins; their close match to P type is what got me interested in this topic; if I had not noticed this someone else may have done a similar study and those two samples would not be in the database; statistically those two should be edited; I edited by -1 in the Results Table, but I do not do minor edits in the analysis files. One of those cousins is tested M458- so I have high confidence both belong to I type, not P type.

This Aug 2011 analysis does not include L260 data from other projects. I’ll wait a few months before reviewing L260 data outside the Polish Project. My last analysis including data from outside the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following topics, which have not been updated for several months:

P type Age - TMRCA: My publication explains the ASD method. The ASD sheet in PType.xls provides 1,778 years using all 67 makers. However, 385b should not be used because 5 samples have recLOH mutation from 14 to 10, providing the unreasonable ASD age of 11,007 years at this one marker. Also, 464 has obvious recLOH issues; my ASD sheet, treating 464a to d as independent markers, comes up with an average of 2,093 years for these 4. Most people who figure ASD age exclude 464. It is interesting that 385a has no recLOH (10 to 14) so far; I do not understand why not. The other compound markers are not issues because the P type values are such that the apparent recLOH cause only step 1 mutations, so they might as well be included.

1,637 years is the ASD age, cell N29 of the ASD sheet, using 62 markers; excluding 385b and excluding the four 464. Exclusion is by typing a blank or zero into a mask, row 21, so you the reader can easily verify that removing compound markers other than 385b has no significant effect.

The far right of the ASD sheet has all the markers ranked by apparent age. I added a Notes column with explanations for some of them. Other than 385b, other old markers should not be excluded because the random luck of STR mutations is bound to produce such anomalies, which are statistically balanced by the 9 markers with zero age (no mutations among the 90 samples). They should all average out. By the way, the number of markers with apparent zero age has been declining in P type as data accumulated during the past few years, as of course it should, but apparent age averaging many markers has not changed more than statistically expected due to the details of new data. My 2009 published age was 1601 years; my update last year on this web page was 1775 years. I have consistently written “roughly 1600 years” in my discussions.

There are a number of reasons why “raw” ASD age should be increased, as discussed in my publication, part I. However, those reasons are mostly due to population bottlenecks in the past. As discussed below, P type evidently went through a rapid population expansion soon after the TMRCA, so the raw ASD age should be used as a best estimate. Anyway, there are significant non-statistical age caveats that produce systematic uncertainties as large as the uncertainties due to population bottlenecks, and much larger than the statistical sampling uncertainties from 90 samples. So any age calculated from ASD (or from any other type of STR variance) should be taken with a grain of salt. My factor of 1.5 uncertainty quoted above is based on my 80% confidence from experience, not from calculation.

385a=10 is the best marker for P type. I have a separate topic for the P type signature. 385a=10 continues to be amazing. 89 of the 90 samples predicted L260+ have the 385a=10 value. Beyond P type, 385a=10 shows up in only 2 samples at step 7 (my two cousins, mentioned above, who should not both be counted), none at step 8, only 1 at step 9, and 3 at step 11. The PType.xls database is truncated at step <12; the full R1a data from the Polish Project - 457 samples - has only 1 more 385a=10 sample beyond step 11. In other words, this one marker 385a=10 is about 99% effective at capturing P type (future L260+ predictions) plus less than 3% additional falsely predicted foreign samples from the rest of R1a. 385a=11 is ancestral (N type and most of R1a), but so far there are no P type with the ancestral 11 value, strong evidence that the rare mutation from 11 to 10 happened before the TMRCA. The 385a & b pair are ranked together tied for 41st in the Chandler rates, not very slow. However, shorter STRs mutate a lot more slowly than longer ones, and step down is slower than step up with stronger effect for shorter STRs. (Chandler discussed this with me by email - his project did not take these issues into consideration - treating compound markers together, with data combined from all haplogroups). In other haplogroups 385a values >14 are not uncommon. So it makes sense that the 385a mutation 11 to 10 should be very rare, explaining why it works so well for P type, although that one P type exception (at step 4) is an even rarer 10 to 9 mutation.

Column CJ of my analysis file shows that using only the best 5 signature markers, cutoff 2, 83 P type samples are captured an none from outside P. That’s better than 80% accuracy using only 5 markers, which is very good and unusual in SNP prediction. Even more unusual is that the one best marker is even better.

DYS540=11. A new signature marker. From the 111 marker STR set recently available commercially. 71 Polish Project R1a samples already have the 111 data, including 12 P type and 12 N type (18 Jul 2011). 11 of the 12 P type have the 540=11 value. 11 of the 12 N type have 540=12. Since P and N are the two parts of the R1a1a1g (M458) haplogroup, this marker nicely distinguishes the two parts with high probability. 12 is obviously ancestral because that value dominates the R1a data. 540 already does not look as good as 385a for P type, but it’s always nice to have another signature marker. It is too early to switch definitions to the full 111 set. I’ll be adding 111 modal haplotypes to my Haplotypes.xls file over the following months; P and N are already there.

That Excel analysis file is intended for finding types - hypothetical haplogroups with < 20% SBP. For P type this is moot because L260 is available. Nevertheless, I used the file to automatically come up with the best prediction, P54, column CF, with SBP 7.6%. That SBP means 80% confidence (if L260 were not known) that less than 7.6% of the predicted P type would not actually belong to the predicted haplogroup. Indeed P54 captures 89 samples, only 3 of which are not P according to my new P43 based on L260 - that’s 3.3% foreigners captured. Since I published the SBP method in 2009, almost all predictions have been better than SBP. But I designed SBP to be conservative (higher percent) to account for statistical biases. I expect eventually to have a few failed predictions (foreign background larger than SBP, or two or more unrelated haplogroups fitting one type definition).

The main point of that PType.xls file: Many definitions are displayed, with various marker selections. I tried a lot more definitions than the ones displayed in that file. The exact definition does not matter much for P type. Any reasonable definition of P type captures more than 90% of P type and less than 10% foreigners. Even the full 67 modal haplotype works OK. Although that P54 has lower SBP than my current P43 definition (9.2%) , P43 is better because I adjusted P43 using L260 results.

I identified P type and submitted my analysis for publication before the M458 mutation was announced by Underhill. The end of my Part I mentions M458 -- notes added during publication. M458 (so far) is composed of P type plus N type plus perhaps a few small clades just outside N. L260, the SNP that defines the haplogroup corresponding to what I have been calling P type, was discovered by a P type member of the Polish Project, inspired by my publication. With him and other coauthors, I published a brief letter announcing and describing L260 in the Fall 2009 issue of www.jogg.info.

P type has obvious structure. Evidence of sub clades. Nodes in the P type branch of the Y-DNA tree. The most obvious evidence is bimodal markers. The bimodal markers are discussed below as clusters - hypothetical sub clades without high confidence. The bimodal markers do not correlate with each other, so none of the clusters qualify yet as types. Future data may provide better statistics with a convincing subtype of P. If this paragraph is not clear, please read the discussion below for the individual clusters: Pa, Pc, etc.

Other evidence of structure: My two edits of the P type definition. In Sep 2010, I increased the number of STR markers in the definition, then edited out the markers that have mutations only in L260+ samples at high step, and not in L260- samples at or just beyond the cutoff. In Aug 2011, I edited out 3 more such markers. Four samples involved, color coded in columns BZ to CA in the analysis file; two do not fit my original P36 but fit the other two definitions; two do not fit the 2010 P46 but fir the other two definitions. These edited markers are also evidence of structure. These are all relatively slow mutating markers. Those samples with such mutations are probably from old nodes in the P branch. Of course, these cannot all be old nodes because some markers will have mutations only at high P step just due to the luck of random mutations. Some samples from young nodes will come out at high step due to luck, and some samples from old nodes will come out with low step. The point of this paragraph is that old nodes defined by rare mutations are expected in any Y-DNA tree, and those samples are evidence of the expected structure in P type. Another point of this paragraph is justification for my method of editing markers. You the reader may be concerned by such editing as selection bias to improve the apparent fit of the data. Indeed there must be such bias in some of the markers that I edited. However, insofar as some of those edited markers truly correspond to old nodes in the P branch, it is appropriate to edit them; future distant cousins with the same rare mutation will be better predicted as L260+. The whole point of using definitions shorter than the full 67 is to remove those markers that define sub clades in order to come up with a proper definition that distinguishes the branch as a whole, as explained in my publication.

Old node comment. It is possible the P type data includes samples that really belong to an L260 branch with a node much older than the next youngest node. In such a case it would not be proper to combine them into the single P type. That one sample at step 9 (discussed above) is an example of a candidate for such an old branch, but then again that sample might just be an unlucky member of a young node (an outlier). Those 4 edited samples of the previous paragraph are also examples. Because there have been very few P type samples beyond my original cutoff, and because all but one of them were easily incorporated with minor edit of the definition, I am comfortable considering them all as a single type until there is evidence of significant L260+ samples beyond P. At any rate, all markers are included in the age calculation, so any old branches contribute to the estimated age of the oldest node (oldest node means MRCA). This paragraph would be a valid comment about any type analysis, but P type is unusually well isolated in haplospace, so the justification is strong to consider it a single clade.

The L260 mutation might be about the same age as P type. Unlikely. We expect a defining SNP to be more likely older than the TMRCA, perhaps much older.

The Western Slavic Modal haplotype, Ysearch 28WGP, matches P type perfectly at all 43 markers used in my new definition. That Western Slavic Modal uses 76 markers, but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.

The Polish Project makes some assignments to P type for samples with < 67 markers if they match the P type model very well. I have not updated those assignment rules for a couple years, but I have been quite conservative below 67, so those assignments are still > 80% confidence.

Let me finish this P type topic with brief speculation about the origin of P type:

What does P type isolation mean? One simple explanation: The M458 father haplogroup for P type and N type seems to have experienced a severe population bottleneck. The evidence: P type and N type are very easily separated by STR values. Both are isolated in haplospace. No overlap. They are so far apart that the nearest neighbors (just beyond the cutoff) for P type include outlier samples (from other R1a haplogroups) in addition to N type samples, and nearest neighbors for N include samples other than P. Apparently, the father haplogroup was quite old at the time of the bottleneck, with lots of variation in STR values. The bottleneck wiped out most of that population, so today men in that father haplogroup descend from just two ancestors, the MRCAs of P type and N type.

Why is P type so large and concentrated in Poland? One obvious explanation is a rapid population expansion not long after the TMRCA. Evidence: Subtypes cannot be defined with confidence. Apparently, the major bimodal markers are due to mutations that happened early in the population expansion, so the branches of P type have similar statistical spread of STR values. For more discussion along these lines see the discussions of the clusters below.

There are other explanations to these questions: P type may represent a huge migration of a single paternal tribe during the dark ages from far away to the region that is now Poland. Perhaps the related haplogroups in that far away place got wiped out by subsequent famines and wars. On maybe they did not get wiped out. If people in that far away place did not tend to migrate to North America in the past, and today do not tend to get DNA tests, then perhaps there are isolated pockets of L260 clades there waiting to be discovered - some with STRs very similar to P type - some with STRs very different than either P or N. Maybe in the mountains of western Asia.

Also, the standard “null” explanation should be considered unless there is strong evidence otherwise. The null explanation is statistical: No significant bottleneck or expansion. Just the luck of random growth of clades in a small human population over the millennia. The MRCA of P & N perhaps were far apart in STR values just by luck - both being outliers. No one knows how to calculate the probability that a large P and a larger N clade can be sole survivors of the statistics of clade growth in the Y-DNA tree in only a couple thousand years. To me it seems highly unlikely. But I don’t know how to rule this null model out in a convincing way.

I can think of more complicated models as explanations. I’m sure you can, too.

Caveat: I said M458 consists of P and N. It is possible some of the outliers from N type might represent small old branches that have nodes older than the node for P & N. There is no evidence to support this, but then again there is no evidence to rule this out with confidence. More data will answer this over the next year, perhaps. Anyway, this is a small detail in the larger picture.

P type Bimodal Markers. This sub topic was significantly edited 25 Aug 2011, when I introduced a definition of bimodal.

The following analysis uses the 90 P type samples (5 Aug 2011) predicted L260+, at 67 markers, discussed above. I also include some comments about the 12 samples available with 111 markers (on 18 Jul 2011). A bimodal marker is evidence of structure, but not proof - a hypothetical clade.

In the past, I have sometimes called these hypothetical types. I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence that more than 80% of the samples belong to a clade that will someday be confirmed as a haplogroup by a newly discoveredSNP. Sometimes we make exceptions slightly above 20%, for example when a type is regionally concentrated.

None of the following bimodal markers qualify as a definition of a type, although some of them might be good enough to be called clusters.

This is not proof that a specific bimodal marker or cluster does not correspond to a future haplogroup. It is still possible that 95% of the samples from a particular bimodal marker belong to a unique future haplogroup. For example, if the son (or grandson, or great great grandson) of the P type MRCA had that defining mutation, and if he participated in the purported P type population expansion, that would explain why his haplogroup (male descendants) have STR values so similar to P type except at the one defining marker. He had no other mutations that differed from his ancestor among the standard 67 that I’m using today for analysis.

It is possible as more STR data accumulates some of the following will qualify as types. Cluster identification is a bit of an art so it is possible I just failed to find a small P sub type and someone else will find it.

Many of the following are probably not unique clades, but instead represent two clades that have widely separated nodes in the P tree. Or three or more.

One characteristic of a type: It shows up early in the data as a cluster with 20% < SBP < 50%, and the SBP continuously decreases in value as more data shows up, as the SBP penalty for sampling statistics becomes diluted. This is good - it means false clusters that show up by luck will not last as more data accumulates. The P bimodal markers that I have been following for a few years (Pa, Pb, Pc, Pd, Pe, Pg) all have increased in SBP, which I take as evidence that they will probably not become types.

Excel files for Pc and Pg are in the on line data with my 2009 publication; I am not updating those or adding any others because none are good enough to stand out. Nevertheless, some merit discussion:

Pa Bimodal Marker. Defined by DYS389 delta = 18. DYS389=13,31. 18 samples (among 90 P at 67). P modal values are 13,30. This is a compound marker; that 2nd number is the sum, so this mutation is in the longer repeat chain; P modal 17, Pa value 18. All the 18’s are 13,31; there are no 14,32 or 12,30 in the Polish Project P type data at this time; my analysis files will capture any future such samples as Pa. That 31 value by itself does not capture the Pa cluster because there are several 14,31 in P type, which I’m calling a different cluster because they are not mutated at the longer repeat chain; the 14 refers to the shorter chain.

Pa is briefly mentioned in my publication at page 172. Pa was the first bimodal marker to catch my attention in 2007 because that 31 value produces the 3rd most common haplotype in Polish data that differs by only one step from P modal values using the old standard 12 marker set; see the table in my publication at page 162. Such a common haplotype at 12 is evidence that Pa is an old sub clade of P. However, the evidence is not convincing yet.

Bimodal evidence: Only 4 samples (value 16) with values other than 17 or 18 for the longer chain.

Pb Bimodal Marker. DYS19=16. 27 samples. P modal value 17. This one is of interest because 16 is the ancestral R1a value, modal for both N and K types. The large size of Pb is a bit of a surprise, because Pb is only 5th largest at 12 markers, and those should be a mix of P and K because Pb differs from both P and K by only 1 marker out of the 12. Those 27 are not K because they have 67 makers and do not fit K type, which differs by multiple signature markers. The large size of Pb might mean there is one large P sub clade that represents the oldest P node, before the mutation to 17, so it is quite old with lots of STR variation. That makes sense, because the proportion of Pb samples that match the Pb modal at 12 markers is not much different than the proportion of P samples that match the P modal at 12.

On the other hand, Pb might be 2 or more clades with unrelated nodes, only one of those might be the oldest, the others being back mutations to 16 by coincidence. On the other hand, that 16 might be a back mutation for most or all samples, as far as we know with the data available today.

Pab bimodal marker pair would have both Pa and Pb defining mutations. There are only 2 such samples (out of 90 at 67 markers).

Pc Bimodal Marker. DYS439=11. 17 samples. P modal 10. Also discussed in my publication starting on page 171.

The combination markers produce Pac and Pbc clusters with 3 and 6 samples. See also Pch below.

I called this Pc type in my publication, but I have since restricted my use of “type” to those clusters in which I have 80% or more confidence.

Pc Cluster is the only significant subdivision so far of P type, although I do not have high confidence that I have accurately defined Pc. P type is the most signficant Polish Y-DNA clade. In my analysis file PcCluster.xls I included a long and tedious Discussion sheet; that sheet is not really intended only for men assigned to Pc; that sheet is intended more as an demonstration of cluster analysis techniques that I have developed, and that may be of interest to other STR cluster analysis enthusiasts.

Pg Bimodal Marker. DYS572=11. 25 samples. P modal 12. Also discussed in my publication page 172. Like Pb, this one is of interest because the 11 value is ancestral; the discussion is similar to the discussion for Pb.

Bimodal evidence: Only 2 samples (one each at 12 and 13) with values other than 11 or 12.

The combinations Pag and Pbg each have 8 samples. Two Pb combinations (above) have 3 or more samples. All other combinations of a, b, c, g have fewer than 3 samples each.

Those two combinations with 8 samples, Pag and Pbg, are instructive. They provide a reason why Pg has not worked as a proposed type in the past. Pg might be comprised of two sub clusters. Pag has the P modal 17 for all 8 samples at the “b marker”. Pbg has the P modal 17 for the long 389 chain for all 8 samples at the “a marker”. 9 Pg samples belong to neither Pag nor Pbg.

For most haplogroups, a cluster of 8 samples with two markers that differ from the haplogroup modal is impressive. However, P type is large and relatively homogeneous. In this case I have tried many combinations; some are bound to come up impressive just by luck; I am discussing only the impressive ones. I suppose if your sample falls into either Pag (or Pbg) there may be 50% confidence that you belong to a clade including more than 5 of those samples defined by the two corresponding mutations, but I personally do not consider the confidence anywhere near 80%.

Even if Pag and Pbg are shown in the future to correspond to two haplogroups, it does not follow that they will be sub clades of Pg; they may be independent branches of the P tree that both received the DYS572=11 mutation independently. Or one of them could be an old node with the ancestral value.

DYS572 is ranked in the Chandler list as 40th, not very slow. In the 2010 version of this web page, I presented evidence that 572 is indeed a slowly mutating marker, at least in R1a. I still stand by that prediction. That would make it reasonable that most of the Pg samples belong to the oldest node in the P tree (but still less than 80% confident for 80% of the samples). Also, we wonder if Pbg is the oldest node in the Pg branch, or if Pbg is a more recent back mutation at the “b marker” DYS19 to the ancestral value? In other words, are the apparently ancestral 572=11 and 19=16 both older than P type, or both younger, or is one older and one younger? We don’t know yet.

There are several combinations; the ones with 3 or more samples: Pah, Pbh, Pch, Pgh, Pagh, Pbch, Pbgh have 4, 11, 12, 14, 3, 5, 4 samples.

My published Pc can also be considered Pch, defined by those two markers that differ from the P modal.

The best 3: Pbh, Pch, Pgh, have 11, 12, 14 samples. These are instructive, particularly if they are viewed along with the previous two “instructive” combinations, Pag and Pgb above. These cannot all be valid clades because the same markers are used in different combinations. This is an explicit demonstration how interesting clusters will always come up if enough combinations are tried. However, if we assume one particular cluster to be valid, that means some of the others are not valid.

Pd, Pe, Pf, Pi, …. My Haplotypes.xls file, near the middle of the “Haplotypes” sheet, has a longer list of bimodal markers in P type.

Plap Cluster. Includes Lapinski samples. This cluster has 8 samples that match perfectly at 14 of the 67 markers. Two of those 8, plus two more at step 1 out of the 14, belong to the Lapinski family set. This is an example of selection bias, because Lapinski recruited the other 3 distant relatives, so the cluster is not as large as it seems. The cluster does not form a type; I mention it here as an example of a tentative cluster.

The Plap modal differs from the P modal at what I call the Pr marker, DYS607 = 17 for Plap vs 16 for P modal. DYS607 is highly variable in P type; there are more 17 samples than 15 samples -- a mildly bimodal distribution. However, those 8 Plap samples, all with the 17, just about account for the excess 17’s, so 607 is no longer bimodal after adjusting for Plap.

Pz Cluster. DYS565=14. Only 4 samples. DYS565 is the last of the 67 set. There are 5 DYS565=14 samples -- these 4 plus another that does not fit. The Pz modal differs from the P modal at 12 markers, so this one is promising for the future. SBP comes out over 50% because of the penalty for small sample statistical correction built into SBP. This one may improve as more data accumulates in a year or so. On the other hand, I studied about 20 P clusters to come up with this best example of a new promising cluster, so the most obvious explanation is luck. If you study STR data randomly generated by a computer you may find a good cluster if you examine enough candidates.

Z93. New topic 31 Oct 2011. This new SNP was recognized earlier this month by ISOGG as R1a1a1h.

So far, all Z93 samples in the Polish Project are coming out L342.2+, and vice versa.

A type, discussed here at this web page since origination, and mentioned in my 2009 publication, is a branch of Z93 (L342.2). A type samples are coming out positive for both SNPs.

I just today added L342T as a new cluster, a hypothetical branch of Z93 (L342.2).

The Z93 category at the Polish Project web page has the samples that are Z93+ or L342.2+ and are not predicted A type or L342T cluster. Z93 also includes samples not tested for Z93 but are close STR matches to a sample that tested Z93+.

I tried to come up with an STR definition for Z93 (L342.2). I could not. Z93 does not have good signature STR markers. Or, there is a better way to say that: The signature markers for Z93 are about the same as the signature markers for Z280 (previous topic), which is a large new haplogroup in R1a. Lots of Polish Project samples are now coming out Z280+. Z280 seems to be equivalent to what I have been calling K type. Z93 and K type have similar STR values at the slower mutating STRs. As a result, the modal haplotype for R1a as a whole is similar to the modal haplotype for Z93 (L342.2) samples, and similar to the modal haplotype for Z280 (K type) samples.

A simple explanation: Z280 and Z93 are “brother” haplogroups, and neither is particularly young. The MRCAs of these two haplogroups apparently had very similar STR values. Originally, both grew rapidly, before significant sub clades could form with STR mutations at slow mutating markers. Over the years, both haplogroups diversified in STR values. So many subclades in Z280 and Z93 today have STR overlapping values. Population bottlenecks eventually produced some sub clades with good STR signatures, such as A type for example, which is very well isolated in haplospace. This paragraphs is a simple explanation of why it is difficult to distinguish all Z93 samples; other explanations are possible, including complicated explanations.

Z93 is a good example of why calculating age of haplogroups is highly uncertain. A type seems to be very young. A type dominates Z93 in the Polish Project. Maybe A type had a particularly vigorous population expansion; or maybe A type luckily avoided a severe population bottleneck; or maybe the A type ancestors moved to Central Europe from distant lands; whatever. Age is calculated from STR variance, so the age of Z93 is dominated by the age of A, which is misleading and too young. If A type samples are excluded, the age of Z93 still would come out too young, because the A type samples have a unique STR signature, which means significant STR mutations, which means the A type MRCA lived at a time when Z93 was already quite old, so the A data needs to be considered when estimating the age of Z93. I’ll try to come up with an age estimate, for next time I update this topic.

On 20 July 2010 I added the following three R1b Types to this web document (next three subtopics, L23EE, L47P, L47A).

Mayka had already added these three to the Polish Project web page during the previous week, based on my recommendation, based on my SBP analysis.

I independently found these three by analyzing the Polish Project R1b data, but Mayka pointed out they were previously known as clusters. We judge that my analysis justifies adding them to our list of types. Since I’m using 639 samples with 67 marker data as representative of Poland, a small type clade at 1% of the Polish population would be expected to have roughly 6 samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.

I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.

The STR definitions for these are available at Haplotypes.xls. PolishCladesUpdate has a link to an Excel analysis file for each of these three types.

Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be optimal for other regions. If you have Polish ancestors, and if you have all 67 markers, and if you match one of these within a step distance of 10 there is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability that you belong. You should test the appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern Europe and you are a marginal match (step distance 5 to 15) for one of these, it is not very probable that you belong to the corresponding Polish clade, because each of these types has some overlap with other clades that are rare in Poland.

L23EE. 20 Jul 2010 documentation: This type is positive for the L23 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch - R1b1b2a1 - of L23.

Nordtvedt pointed out the cluster for this type some years ago, calling it R1b-EE (Eastern Europe). Mayka suggested the L23EE code to me.

There are only 6 samples in the Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a small type. The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%. In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+. So this type is very well isolated in haplospace in Poland.

On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia. That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain. I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.

This type has evidence of structure. A number of markers are bimodal with no obvious correlation. To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.

If you match this type closely at 37 markers I highly recommend getting the full 67, because the statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51 test; a negative result confirms membership in this hypothetical clade, and a positive result means you are not a member. We do not know the probability of outsiders matching L23EE in STR values, particularly outside Poland, so there is still a slim chance of a surprise - a close match to the definition but with L51+.

L47P. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is a hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. This type is probably negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one sample so far so it is not certain.

Mayka announced the cluster corresponding to this type on the web in March 2009.

There are only 4 samples in the Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small type. The cutoff is 7 and the gap is 10. There are no samples from step 7 to 16. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.

This type is very robust; the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.

Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.

Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project. Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.

Members of this type should test for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b, which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in STR matching, probably less than 10% in Polish data, but I do not know what the exact percent interference be until more data accumulates.

L47A. 20 Jul 2010 documentation: This type is positive for the L47 SNP, hence this type is another hypothetical future haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative for L44, a known branch of L47.

Mayka suggested the “A” code, since this type is obviously Ashkenazi, based on family names (see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA projects, although I did not do the research to find a first web publication at 67 markers; I would appreciate an email of a reference to add here, even if it does not exactly match my definition. It’s OK if an international modal haplotype differs by a few markers from a haplotype determined in Poland, particularly if the difference is at markers that are bimodal, indicating subtype structure.

There are only 5 samples in the Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to 18. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.

This type is very robust; the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.

This type is better yet on Ysearch (code 7HB9C), with 18 samples (13 Jul 2010) for better statistics; SBP = 4.6%, which is remarkable. It might be even better with an optimized definition; I used the modal haplotype that I extracted from the 4 Polish Project samples.

This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.

So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48. In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR values expected for L148. (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.). All this will quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148. Mayka provided the SNP data that I have documented here.

At the end of July 2010 I added two types from the I haplogroup to this web document. I independently found these two by analyzing the Polish Project I data. Mayka informed me that they were previously known as clusters, hypothetical clades, discussed some time previously by Nordtvedt. Mayka added these two to the Polish Project web page in July 2010, based on my recommendation, based on my SBP analysis. One is a branch of what has previously been called I2-CE, and seems to represent a Polish collection of M253 branches so we named it M223CE type, discussed in the next topic. The other seems to be a Polish branch of I1-M253, so we named it M253P type, discussed in a topic below. I an now also using the short code names I-CE and I-P for these. I am now splitting I-CE into I-C, I-D, and I-E, topics below.

Instructions for Ysearch comparison are below. These types are calibrated to Polish Project data. The I-P definition WC8JD forms a type in the Ysearch database, so it seems to be reasonably valid world wide. The I-C definition SB6YK, and the I-E definition QUXE3, are probably not valid at Ysearch for a sample with origin remote from Historical Poland, because of interference by other clades with similar STR values, particularly from Russia.

I-CE. (M223). Update 25 Mar 2012. ISOGG code is now I2a2a; last year’s code for M223 was I2b1, still being used at FTDNA and the Polish Project.

All the I-CE samples in the Polish Project fall into one of the 3 branches discussed in the following topics.

The M223 clade is very well isolated in STR haplospace. FTDNA is able to predict I2b1(M223) with high confidence using only the first 12 standard markers, for more than 90% of the samples. Using 67 markers, I found that any reasonable definition does a good job of extracting M223 samples from Y-DNA STR data. A good definition is available on Ysearch, code 4H6C9, using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).

STR isolation in the Polish Project is generally evidence of a single Polish clade. It is possible that two or more clades with distant nodes in the Y-DNA tree might have similar STR values by coincidence. In the case of Polish I-CE, since the larger I-CE world-wide clade is well isolated, my Polish I-CE type might well be a collection of multiple clades, perhaps including some clades that are not particularly concentrated in Poland. My original M253CE type used 4 of the 8 I-CE samples back in 2010. There are now 12 I-CE samples, and they form two types plus one cluster. It may seem silly to split these into 3 branches, but there are new SNPs, discussed below, that justify the split as valid haplogroups. These small types are interesting because they are preliminary evidence of small Polish clades.

The M223-Y-Clan project has lots of data; I used this project data for reference.

A good signature is (392, 437, 450) = (12, 14, 9), which distinguishes almost all M223 samples from others, allowing one mutation step. (594) = (11) is also an excellent signature for M223, with the value 10 dominant outside M223, but this one is strange in the Polish Project, where 4 of the 12 samples have value >11; this is evidence that I-C might comprise two clades.

At Ysearch, the percent Polish samples for I-M223 is low. The following 3 STR definitions, my proposed Polish branches, capture a small fraction of M223 at Ysearch.

My Excel file I-CE.xls has analysis of this type and also analysis of the following three branches. That file has ASD analysis, but ASD age is very misleading when calculated from samples that are a collection from multiple large old clades. The three branches have too few samples to attempt age estimates.

I-C. (M223+ P78-). (I-C Type Branch). New topic 25 Mar 2012. I-C type is a hypothetical subdivision of I-CE (M223).

I-C type includes all 4 samples assigned to I-CE last year, plus one that was missed last year, plus 3 new ones, for 8 total at 67 markers in the Polish Project. SBP has improved from 19% to 2.6% over the past year, so this is a clade with high confidence due to the excellent isolation, although there is a chance it may be two or more independent clades as discussed above.

My Excel file I-CE.xls has analysis of this type in column CJ, SBP=2.6%. My definition uses 67 markers, cutoff 20, gap 14. There are no Polish Project samples in the gap from step 20 through 33, so this type is very well isolated. This definition also isolates I-E type, 4 samples, steps 34 to 42, but there is a better definition for I-E, see the next topic.

There are no Polish Project samples at step 43 or 44. There is only one I2b2 sample (not M223) as step 45. Then there are no further samples at steps 46 through 52. So this I-C definition also captures all of the broader I-CE (M223), although surely a better I-CE world wide definition could be constructed.

A good signature is (406, 487) = (10, 12), which itself distinguishes the 8 I-C samples in the Polish Project.

Two of the I-C samples are I-D samples, discussed below. Two other I-C samples have the same family name, very close in STR values. The remaining 4 samples in I-C are not particularly close to each other in STR values. The SNP data for each sample is included in column BX of the “Calculator sheet”; 4 of the samples tested negative for all 4 known haplogroup branches of I-M223. So I-C seems to capture M223* plus P95 (below) in the Polish Project.

My definition is also available at Ysearch, SB6YK. On Ysearch there are plenty of samples from step 20 through 33, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.

I-E. (M223+ P78+). (I-E Type Branch). New topic 25 Mar 2012. ISOGG now I2a2a3; last year’s code for P78 was I2b1c, still being used at FTDNA and the Polish Project.

My Excel file I-CE.xls has analysis of this type in column CM, SBP=13%. My definition uses 67 markers, cutoff 19, gap 7. There are no Polish Project samples in the gap from step 19 through 25, so this type is very well isolated. Only the I-C samples are all at steps 26 to 44, so this definition also nicely separates I-C from I-E in the Polish Project.

A good signature is (393, 459a, 446) = (15, 9, 10), allowing one mutation step, which distinguishes the four P78 samples in the Polish Project.

Both the P78+ Polish Project samples are in the M223-Y-Clan Project, and there are 13 others, but there are many more P78- in M223-Y-Clan, so this is not a particularly large subdivision of M223.

The other two I-E samples in the Polish Project that have not been tested for SNPs, but both have P78+ close matches on Ysearch, and no close matches from the other 3 branches of M223, so those are likely also P78+.

There are two other known haplogroup branches of M223: M379 has no positives in M233Y-Clan, and plenty of negatives, so it is very rare. M284 has plenty of positives in M223-Y-Clan; that branch is a large subdivision with a couple known branches of its own, but no samples in the Polish Project.

My I-E definition is also available at Ysearch, QUXE3. The Ysearch closest matches are I2b1c, so my definition is good at extracting P78 samples, but I suppose a better definition could be constructed for the world-wide P78 data. On Ysearch there are plenty of samples from step 19 through 25, including some I2b1c beyond step 25, so this definition does not work world-wide. The closest fits are not concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR definition will not find members with confidence far from the region of Historical Poland.

I-D. (M223+ P95+). (I-D Cluster). New topic 25 Mar 2012. ISOGG now I2a2a4; last year’s code for P95 was I2b1d.

There are only 3 samples P95+ in the M223-Y-Clan Project, and many P95-, so this is a small haplogroup. Those 3 include one but not both of the Polish Project I-D. Two of those 3 have Poland listed as origin, and the third has no origin listed, so this may be a Polish clade, but it is too soon to tell. It is possible that I-C has a larger subdivision Polish branch, of which this I-D may be a branch, but this is just speculation until we get more data.

(640) = (13) seems to be a signature for I-D, but one STR marker should not be very reliable for prediction.

I did not enter a definition into Ysearch. The two I-D samples are highlighted bold blue in column CI of I-CE.xls. Only one sample is P95+ in the Polish Project - the one that is also in the M223-Y-Clan Project, so I used that sample as the definition. There is a sample at step 10, and none others out to step 22, so I tentatively assigned that step 10 sample to I-D, forming a cluster of two samples, SBP=25%, well isolated from others but not a type.

On 26 July 2011, I added this Polish type for I1 haplogroup to this web page. This type has been known as a cluster for a few years. Mayka pointed out to me that Nordtvedt listed it on the web. Marek Skarbek Kozietulski has studied this cluster quite a bit, since he’s a member. I mentioned this type briefly in my publication, where I was previously calling it Y type, considering it not high confidence based on the data available then in 2009. I am now very confident that I-P type corresponds to a valid clade, concentrated in Poland, to be verified someday with a new SNP discovery.

I have also called this M253P type, because I-P samples test positive for I1 (M253) and negative for the branches of I1, although new SNPs for I1 are being rapidly discovered, and the newest have not yet been tested for I-P. So this is a type within the paragroup I1*, although a low fraction of samples from I1* are members of this M253P type. Marek has done the WTY , March 2011, without finding an SNP for I1.

My analysis file is I-PType.xls. My definition for I-P type uses 54 markers, cutoff 4, gap 5, no samples in the gap from steps 4 through 8 in the Polish Project at 67 markers.

SBP came out 6.4% for the 9 samples in M253P in July 2011 in the Polish Project at 67 markers. There are now (Mar 2012) 11 samples with SBP = 5.0%. Marek informs me that he had identified 4 men who matched at 12 markers and actively recruited them to obtain all 67 markers and to join the Polish Project. That means only 7 of these 11 samples should be used for statistical purposes. SBP calculated on the basis of 7 samples is 8.7%, which is excellent evidence of a clade that is isolated in haplospace.

I used all 11 samples in my analysis file in order to best estimate the definition, which are also available at Haplotypes.xls.

A good signature for M253P is (391, 392, 447) = (11, 12, 24), although this signature alone is not foolproof for distinguishing I-P from all other I haplogroup samples.

Nordtvedt's I1 Tree has this I-P type as I1*-P1, with related clusters I1*-P2 and AS4.

Here is some interesting speculation for which I do not have convincing statistical evidence: Marek points out that a sample at step 4 on Ysearch is Danish, which adds to his evidence that there might be a related clade in Denmark, perhaps with a node in the I1 tree slightly older than the node for the I-P Polish clade. I do not know where that Danish sample falls in Nordtvedt’s tree.

Ysearch provides evidence of concentration in Poland. My definition is WC8JD. 73% of the samples that come up in Ysearch (8 of 11) have Poland as origin. Although this is a small statistical sample, this is the most Polish concentrated type I have seen so far. SBP=22.1% on Ysearch, due to that single Danish sample at step 4, so although statistically less confident at Ysearch, my definition can suggest samples from Ysearch for the hypothetical I-P clade, albeit with lower confidence than samples with Polish origin.

The age comes out only 567 years using all 67 markers. See cell N12 of sheet ASD in my file. There are many caveats associated with age calculation based on ASD, and this is a small statistical sample. Insofar as Marek may have recruited with a bias toward close matches, the ASD age is biased low. That said, it is clear that I-P type represents a young clade.

N-G. (N-L551). (N-G Type). Update 22 Mar 2012. Introduced on 17 Oct 2010 as “N1c1(M178)-G type”. The latest ISOGG code is N1c1d1a (L551).

Mayka suggested this one, based on a suggestion by Andrzej Bajor, from his Rurikid Dynasty Project. This type is concentrated in Lithuania, and Andrzej suggests that at least one member might be a male line descendant of Gediminas, the medieval Lithuanian Duke. Hence the “G” code.

This type has 9 samples at 67 markers very well isolated in the Polish Project with SBP = 8.9%. See N-GType.xls. The definition is also available at Haplotypes.xls and at Ysearch as RGE95, using 51 markers, cutoff 3 (samples < step 3). All but one of the N-G samples can be extracted from the Polish Project using only the signature (392, 607, 557) = (15, 14, 13).

That new L551 SNP verifies our prior prediction that G type corresponds to a clade. All 9 of the predicted G type samples at 67 markers have tested L551+, and samples predicted just beyond G type are coming out L551-. Of course, there will probably be a few exceptions as more data accumulates, but so far N-G type (STR match) is equivalent to L551 in the Polish Project.

At Ysearch, N-G type is not as well isolated; the SBP is 22% with cutoff 4, due to interference by what might be a Russian clade. There are many Lithuanian samples matching my N-G definition (RGE95), including Lithuanian samples beyond the cutoff (step 3). 46% of the Ysearch samples below step 9 indicate Lithuanian origin. L551 is too new to be included in Ysearch, so this paragraph refers to N-G type as defined by STRs.

I do not know if the Polish Project N-G samples are an independent Polish sub-clade of a larger Lithuanian clade; or if the Polish Project samples are just a random sample of individuals from a larger clade(s). I have not taken the time to search other projects for STR matches to my N-G definition, or to search for more L551+ samples. Someone might inform me before I get a chance to search. Watch this topic for updates.

The age of N-G type seems to be less than 1,000 years, perhaps only 500 years. Check the “ASD” sheet in my analysis file. ASD age is highly uncertain, particularly for such a small sample, but G type has little STR variance, so surely G represents a clade younger than 2,000 years old. Isolation is evidence of an old node, with TMRCA much younger than the node. The age of the L551 mutation can be anywhere in the time span older than the TMRCA of G type and younger than the node. N-G type is well isolated in Lithuania and Poland, but N-G may have a relatively young node with those other clades world-wide with similar STR values. Those other clades can be used to better constrain the age of the L551 mutation.

N-M. (N-L591). (N-M Cluster). Update 22 Mar 2012. Mayka suggested this one also, introducing it at the Polish Project in Jan 2011, as “N1c1(M178)-M Cluster”. The latest ISOGG code is N1c1d1b (L591). Includes Mickevius (Mickewicz) descendants. Hence the “M” code. Also concentrated in Lithuania. These two, N-G and N-M, are a small fraction of the M178 clade.

I call this a cluster because it does not meet my criterion SBP<20% to be called a type. Actually, the original proposed cluster is equivalent to what I am now calling Ma cluster, discussed below. The recent new SNP named L591 is coming out with about twice as many samples, so we have adopted the “M” short code name for the STR data for L591; this larger N-M cluster is so considered equivalent to N-L591.

My analysis is available, N-MCluster.xls, 10 samples at 67 markers. My best automatic definition for N-M, column CL, SBP=25%, is 80% accurate, missing one sample that is obviously L591 and predicting one sample that came out L591-, out of 10 predicted. Actually, this result is a nice confirmation of my SBP method, because although the data has only 10% background (false positives captured by the definition), my SBP formula has an increase to account for statistical confidence; hence 25% is a better upper confidence estimate of the background for so little data. I bet as more data accumulates my best N-M definition will drift below SBP=20%, qualifying as a type. Anyway, this is moot, because L591 is a better criterion for the clade, and there is a logical distinction between the N-M cluster (samples with STR correlation) and the L591 haplogroup. My definition serves as a guide for priority for L591 testing. Testing should be concentrated near the cutoff.

Accordingly, I came up with an improved STR definition for L591, using a mask to manually adjust marker selection. I’ll still call it by the short code N-M. Column CC in that file. SBP=50%, but SBP does not matter here, because the purpose of the definition is not to discover a hypothetical clade, but to predict samples for a known clade. Most clades do not produce low SBP because most clades are not well isolated. Let me elaborate with discussion of the statistical issues for N-M:

Obvious issue: There are three N-M samples with a very rare 6 step mutation at DYS446, from 17 to 11. Without DYS446, two of these three marginally fit the N-M cluster (based on STRs). These three seem to represent a subclade of L591 with modal STRs slightly drifted since their node. I marked them as “Mb” in that Excel file. Only one of these has actually tested L591+. Another one of these is that “obviously L591” sample that I mention above, the “obviously” based on this 6 step mutation, which is almost as good a marker as an SNP. That “obviously” sample is an STR outlier at other markers, which need to be excluded from the L591 definition, assuming more samples like this will show up. This seems obvious, but it needs verification with more data over the near future.

Speculative issue: There are two other outliers, which I labeled Mc and Md. Tested L591+. These may represent two clades with nodes only slightly younger than the TMRCA for L591, with independent modal drift. Highly uncertain. They might just be statistical outliers, due to the luck of random mutation. Again, more data will tell. For now, I adjusted the N-M definition to capture them, on the assumption that some future samples might come up with similar STR values.

Another issue: That one sample, mentioned above, fitting the M cluster very well but L591-, probably represents a clade with a node slightly older than L591, but similar STRs by coincidence; there may be other such clades. Again, this is speculative, but I adjusted my definition to exclude this one.

Statistical speculation summary: L591 does not seem very well isolated in haplospace, albeit more isolated than most young Y-DNA clades. It seems the L591 tree has nodes close to the SNP age, both younger and older.

My L591 definition is available in that Excel file, in Haplotypes.xls, and at Ysearch as 64RUG.

This L591 clade seems to be concentrated in Lithuania. The evidence is Ysearch - Lithuanian concentration of the N-M cluster. L591 test data is not available yet at Ysearch. My Ysearch analysis (data in a sheet in that Excel file) is similar to the G type analysis: SBP not as good because of apparent interference from clades world-wide. Using the N-M definition at Ysearch, there is Lithuanian concentration at steps well beyond the cluster cutoff, so there seems to be a larger Lithuanian clade.

In the Polish Project, I spotted evidence of such a larger STR type, about double the size of N-M, including the all the N-M samples as a sub-clade. I colored these samples green in column BX of N-MCluster.xls, using all 67 markers. I dubbed this one N-L type. That 67 marker evidence is not satisfactory because it captures a couple N-G samples. In another file, not posted on-line, I came up with a satisfactory definition for N-L; I provide it in the “Haplotypes & Masks” sheet, row 21, of N-MCluster.xls. Mayka advises me that there are two new SNPs, L1025 and L1027, that are currently candidates for a haplogroup larger than L591. We are waiting to see how those come out before introducing N-L. That N-L definition cutoff provides a suggestion of where to prioritize SNP testing.

The age of N-M (L591) comes out similar to the age of N-G type, probably less than 1,000 years; see that short paragraph in the N-G topic above. My comments about isolation of N-G in the Polish Project do not apply to L591. For N-M, it is important to exclude DYS446, because that one marker triples the age as calculated using ASD (STR variance), due to that 6-step deletion mutation mentioned above. You can see this by editing cell BV21 in my mask in my “ASD” sheet in that file. Another way to edit this is to edit the 446 value, to make the mutation count one or two, which is more representative of the age. This is a good example of one of the caveats associated with age calculation based on STR variance.

N-Ma. New topic 20 Mar 2012. This is the original “N1c1(M178)-M Cluster” cluster, explained in the previous topic. Only 3 samples when introduced Jan 2011, SBP=36%. Now there are 5 Ma samples, SBP=30%. Although still not qualified as a type, there is better than a 30% chance this will improve over the next couple years as data accumulates. Lithuanian concentration, same as N-G and N-M. Again, I do not expect validity world-wide for N-Ma because of interference from other clades world-wide, but this might grow into a nice small, young Lithuanian clade. Analysis is in N-MCluster.xls, where the 61 marker definition for Ma is in column CG.

Click on the Create A New User tab, where you can upload your Y-DNA STR data from a number of testing services. Or, you can type in your data. You end up with a “User ID”.

Ysearch has a Research Tools tab to click, where you can type in other User ID’s for comparison.

Cluster Genetic Distance Method; for Haplogroup R1a: P - Pc - Pg - N - K - A - I - B - D - E - Fa - Fb - H - M - G:

USEID, 8U92G, RQK32, 92HEK, 3SEJK, MN8R3, FCUFG, EKVHX, RU8Z8, K49NZ, GNYBG, YQ6D2, EFQM7, 559EE, 24MB4, ZD29Z

Results: If there is a small genetic distance result (3 or less) for one of these types, you have a high probability of belonging to that type. There are more detailed rules available, see the “Polish Project Rules” sheet in the R1a Assigner.xls file. For haplogroups I, N, R1a, and R1b, see alsoHaplotypes.xls.

Reminder: This web page concentrates on the region of Eastern Europe associated with Historical Poland. If your male line is not from this region, the results of this Ysearch comparison may be misleading if there are unrelated clades, rare in Historical Poland, with haplotype range that overlaps one of these. Search for my discussion, in this web page, for your best match type; in some cases I have evidence for interference world-wide (significant matches by unrelated clades). Many men of Polish male line ancestry do not match any of these types; this web page is a work in progress. For non-Polish there is a higher probability of not matching any of these types.

Follow the R1a instructions above, except copy the following line into the “UserIDs” bar at the Research Tools page:

This topic was completely rewritten during Dec 2010 & Jan 2011; last update edit 15 Jan 2011.

Lawrence Mayka is the administrator of the Polish Project. SNP results are not posted on the web. Most of my SNP data comes from Mayka. Some of my data comes from Cyndi Rutledge, the administrator of the R1a Project. Many men join both projects, but of course many men purchase the L260 or M458 test and do not join either. If you are an administrator of an FTDNA project (or a project at another database) you may send me the L260 and M458 results for your project for merging into my analysis, if you wish. Karen Melis, the administrator of the Zamagur8ie Project, also sent me a few M458 results.

Data with the 67 standard markers is most common in the SNP results because Mayka and I selected these for the initial tests. In addition, men who have purchased less than the standard 67 markers are less likely to purchase SNP tests. This discussion is limited to the 67 marker data with only brief comments about those with <67.

Mayka and I purchased many L260 and M458 tests for Polish Project members, so test results available to me are biased toward Polish data. Also, I suppose men who notice my publication and web pages about Polish types are more likely to purchase the L260 and M458 tests, so even data not available to me might be biased toward Polish data. At first we were concentrating on samples that match P type and N type very well, so much of the data available to me are biased toward P type and N type, of course. Later we concentrated on borderline samples that just barely match P type and N type, in order to better define the borders in STR haplospace. If there are clades from outside R1a1a1g (M458) that just happen to have STR values that match P type or N type we will discover them quickly, but not if they are concentrated far from Poland, and particularly not if they are concentrated in any Eurasian lands where men do not tend to get DNA tests. If there are M458 clades with STR values very different than P type or N type it will take some time to discover them all, because those will require “deep clade” tests by men without an M458 prediction to do the M458 test anyway. I have many such “wildcat” results; so far I have no L260+ or M458+ with STR values very distant from P and N type. I have comments below in this topic about the few outlier results a few steps beyond P and N types.

The SNP results do not provide estimates of population frequency because we are selecting the most interesting samples for SNP tests. However, since the SNP tests verify my type classification, my STR types provide credible frequency estimates. My Results Table is still the best estimate of frequencies in Poland: P type for M458+ L260+; N type for M458+ L260-.

My types are defined by STR values following my mountain method. For samples with all 67 standard STR markers my P type definition uses 46 of those markers; N type uses 45. The cutoff for both P and N is step 7, which means samples less than 7 genetic distance (step mutations) from the definition are predicted as belonging to the corresponding type.

To be fair, I should point out that I was a bit more conservative with my P and N predicted assignment rules 2 years ago, before the M458 and L260 SNPs were available, and when there were not as many samples with all 67 markers. Also, there were fewer known types 2 years ago. Half of today’s P and N outliers would be missed using my rules from 2 years ago and the others would be placed into “PK Borderline” and “NK Borderline” categories because 2 years ago I was more concerned about distinguishing P and N from K type, now known to be M458-. I no longer use those PK and NK categories. With recent data, my current STR based assignment rules are much more accurate for P and N outliers. I changed the P type definition October 2011.

I cannot define P type as exactly equal to L260, nor can I define N type as exactly M458 minus L260, because the types are defined by STR correlations. The outliers may be statistical, due to the luck of random mutations, particularly for P type with only 2 outliers so far (15 Jan 2011). I find that unlikely for N type, because the N branch STR distribution seems to have a non random tail extending to many outliers. It is possible that N branch outliers represent very small clades (perhaps only one clade) with old nodes in the Y-DNA tree. However, any particular outlier at or beyond the N cutoff cannot be assigned with confidence to a subclade of N. This is the reason I use the word “branch” instead of type for outliers, because I cannot be confident they all belong to the same young clade, as opposed to multiple young clades with old branches - with old nodes in the Y-DNA tree

However, those N type outliers provide confident assignment rules. At the N cutoff step N=7 all 4 samples in the Polish Project have been tested M458+ confirming N branch. At the next step N=8, 3 of the 6 in the Polish Project fit well for prediction into one of the M458- types, and 1 of those has been tested M458-; the other 3 N=8 do not fit any of the other types and indeed have been tested M458+ confirming N branch. This analysis is continued below in the next topic; the result is that samples without SNP results that have STR values at the cutoff or 1-2 steps beyond P or N type can be predicted with 100% probability (not 100% statistical confidence) to belong to the corresponding branch, for those samples that do not fit another type. At 3 steps beyond the cutoff probability is still about 50% for belonging to the branch.

P type and N type are very well separated from each other. Within P type, there is only 1 sample with steps N=P+5; all others are N>P+5. N type is more diffuse in STR values than P type. For N<6 there are 3 with P=N+5. The most ambiguous N type sample has N=7 (cutoff) P=8, and that one has been evaluated M458+ L260- confirming that samples marginally N type are really N branch. There are 3 others with N=6 or 7 and N<P<N+4; 2 of them are confirmed M458+ L260- and the other is M458+ but not tested for L260 yet. The most distant sample has N=10 P=9 and it is confirmed in the N branch, M458+ L260-, again providing the insight that distant STR samples with P step about equal to N step tend to fall into the N branch. (Again, this is for Polish Project samples that do not fit another known type). Of course, we expect someday to see exceptions, just due to the luck of random mutations.

There is one sample with P=9 N=11, but that one has an recLOH mutation that scores 4 steps at the DYS464 set. This is really only one mutation, so I manually adjusted the step to P=6 on this one.

For P type, the closest M458- sample has P=7 (cutoff); it fits I type; this is the sample that originally sparked my interest in P type. A P=8 M458- sample is assigned to K Borderline. A P=9 N=9 sample is the closest M458- sample that does not fit any known type, so is assigned to the Remaindercategory.

Borderline comments: In the Polish Project we use borderline categories for samples that have 50% to 79% confidence of belonging to a haplogroup or type. For P and N type samples with 67 markers, borderline means the SNP test has not been performed. With SNP results, samples are placed in the corresponding P or N type, with the understanding that outliers may in fact belong to closely related clades, as explained above.

Remainder comments: I use remainder categories for samples that have less than 50% estimated probability of belonging to any known type. Until recently we distinguished between the Rx458 category for samples not tested for M458 (and not positive for L260) vs the R458- category for samples that have been tested negative for M458. Today, all samples distant from all known types have been coming out R458-, so the Rx458 data has been merged into the R458- category.

During 2010 I used a R458+ category for N branch outliers, to distinguish outliers, which might not be true N type members. However, the distribution of N STR values is continuous, with no objective cutoff for N type vs N branch, so the R458+ distinction was dropped for now.

This discussion concentrates on samples with 67 markers for clarity. There 31 with only 37 markers and 2 with only 12 that have SNP results. I watch these for obvious anomalies; none yet. Analysis has lower confidence with fewer markers.

Summary of results: P type and N type are very well isolated in STR haplospace. They are well isolated from M458- samples and even more isolated from each other. Roughly 90% of the M458+ samples cluster into the two STR types within which I can make future SNP predictions based on new STR data with virtually 100% confidence. The roughly 10% remainder have STR values near the cutoffs for the types, mostly N type. Future STR predictions for these can be made with more than 50% statistical confidence (up to 100% probability based on the few data available so far) because most of these that do not fit one of the other known types do come out L260+ if closer to P type and M458+ otherwise. It is possible that some of these outliers belong to small clades (perhaps only two or three) that have older nodes in the Y-DNA tree.

Age (TMRCA) of haplogroups is uncertain due to a number of caveats. That said, N type seems to be about 2,000 years old and P type seems to be about 1,500 years old. Those estimates can be up to a factor of 2 incorrect, as discussed in my caveat topic. The ages of L260 and M458 are particularly uncertain because the calculated ages are dominated by P and N types, which are quite young. The SNPs may be much older, for all we know. The outliers in the P branch are too few to have significant effect on the calculated age of P type. It is possible that the N branch is really two (or more) types that are just as young as P; the calculated N age in such a situation would come out older. Ng type provides preliminary evidence of a hypothetical subtype of N, but Ng is too small and too close to N to affect the calculated age of N.

What does all this mean? There are a number of explanations. Here is the explanation that seems simplest to me: The R1a1a1g (M458) clade seems to be thousands of years old. It may have expanded into a large population long ago. The members of this clade diffused into a wide distribution of STR values over the millennia. Then there was a severe population bottleneck followed by a rapid population expansion, or multiple bottlenecks followed by multiple expansions. The living members of M458 descend from only a few men who each lived near the beginning of the most recent population expansion. Almost all living M458 men descend from just two of those men: the N type MRCA and the P type MRCA. A low percentage of living M458 men perhaps descend from other MRCAs who lived at roughly the same time as those two, as evidenced by the outliers in the N branch SNP data available to me today.

This topic was completely rewritten during Dec 2010 & Jan 2011; last update edit 15 Jan 2011.

1 P=9, but P=5 or 6 if corrected for recLOH, so predicted P type; counted as P<6; confirmed L260+

42 P type; so far, all samples below the cutoff 7 came out L260+, confirmed P type

1 P=8 P branch outlier confirmed L260+; this one from Czech Rep. is not in the Polish Project

2 P branch outliers; so far, all SNP data samples with P<9 are either P type or fit well to another type

So far, all SNP data samples with N<9 are either N type or fit well to another type

7 P=6; 1 step below cutoff; would be predicted P Borderline prior to SNP evaluation; all 7 are M458+

3 not yet tested for L260 probably most of these will be positive, now predicted P type

These represent all the Polish Project samples at step 6, 1 step below the cutoff, because these were selected for M458 evaluation soon after M458 was discovered. So step 6 is not as common as it seems in this SNP analysis.

1 M458+ L260+ P branch outlier; not Polish Project; R1a project from Hostacov CR

1 predicted K Borderline; result M458- confirms not P or N type; still predicted K Borderline

1 N=11; M458+ L260+ This one has recLOH at 464, contributing 4 steps, so I consider this equivalent to P=6, so I count it as predicted P type, not an outlier. This is marginal, since it could be argued that the recLOH mutation may have happened after a 1 step mutation at 464 for all we know, making 2 steps, placing this sample an outlier at the cutoff 7, so my decision to predict him P type is arguable.

1 of the 6 is P=8 just beyond cutoff, but P is a tighter cluster, so this would not be predicted P, and this one came out L260- as expected

4 N=7; cutoff. These represent all the Polish Project samples at step 7, because these were selected for M458 evaluation soon after M458 was discovered. So step 7 is not as common as it seems in this SNP analysis.

even at N=9, 2/3=67% probability N branch for samples that do not fit other known types

DYS385a. The single STR marker called 385a is by far the best signature for predicting P type vs N type. All 34 samples with L260+ result so far have the value 10. All 25 samples with L260- result so far have the value 11.

Usually, a signature with more STR markers predicts better. In this case, discriminating P (L260+) from N (L260-), 385a=10 predicts best by itself. No signature with 2 or more markers discriminates better. In fact, just 385a=10 works as well as the 46 marker P type definition.

This seems amazing, but is not entirely unexpected. STR markers have lower mutation rates at lower values, and step down mutations are less frequent than step up. Since N type has mostly 385a=11, step down to 10 should be less often than step up to 12.

The mutation rate of 385a=10 in P type (L260+) seems very low. At another of my web pages I postulate a rare SNP in the middle of a long STR chain to explain a low mutation rate, but such a postulate does not seem necessary in this 385a case because of the short STR chain value. For the lower rate at lower STR values, I provide a reference to Whittaker (2003) in my publication.

We can predict that future M458+ samples will be L260+ if 385a=10 and L260- otherwise. The probability is 100%. Exceptions are zero out of 59 L260 results so far. I figure the confidence of this prediction at 94%: Poisson 94% confidence interval for zero is the interval zero to 3.5; (1-3.5/59) = 94%. In other words, I am 94% confident that 3 or fewer samples out of the next 59 L260 measurements in the Polish Project will be exceptions to this new rule - that 385a=10 means L260+. Exceptions will be found eventually, of course, due to rare independent mutations from 11 to 10.

In the Polish Project, all 96 samples assigned to P type and all 15 samples assigned to P Borderline have the value 10 for 385a. There are 89 samples assigned to N and N Borderline. Only 7 of these have the value 12 for 385a; the other 77 have the value 11. In this case, predicting P type based on 385a=10, zero exceptions out of 100 samples, provides 97.8% confidence.

I postulate that 385a has only a slightly higher mutation rate in the N branch, at value 11. I postulate that those 7 N branch samples with 385a=12 belong to 2 or 3 subtypes in the N branch, 2 or 3 independent instances of a mutation from 385a=11 to 12. Most of these belong to a hypothetical Ncm type. The data is not sufficient yet to provide statistical evidence along these lines.

385a does not work quite that well for discriminating P type from all of R1a. Among the 91 M458- samples not tested for L260 there are 2 with 385a=9 and 4 with 385a=10. None of those are expected to be L260+ because L260 is a subhaplotype of M458. The 385a marker is still the best single marker for extracting P type from a full R1a database, including M458- samples from outside the M458 (P+N) haplogroup. However, in this case, using 2 or 3 markers works better, and of course the definitions (46 markers for P, 45 markers for N) work much better than any short signature.

A few samples with 385=(10,10) represent a hypothetical subtype within P. I call this Pk. I’ll discuss it more if and when there are enough samples for statistical significance.

Other signatures. Table 3 of my publication provides other signature markers. DYS572=12 continues to be 2nd best for P type. DYS 537 continues to be best for N type.

My R1a page has a handy 3 marker signature table. I announced this more than a year ago, as a handy prediction signature for the dominant types in R1a, using only the first 25 markers most common on the internet. It still works well. That signature uses (385a, 439, 447). The values for P type (L260+) are (10,10,23). The values for M type (M458+ L260-) are (11,11,23). The values for K type (M458-) are (11,10,24).

Lawrence Mayka (independently, March 2007) constructed a “median joining network” Network for the 37 marker samples of the Polish Project. This network supports the definitions of the P & N clusters, and of the A subcluster. The P cluster is the left side of Mayka’s network; N is the top branch, and A is a small branch on the lower right.

29 March 2010 correspondence: I mentioned Russian sites for R1a clusters in my publication. It’s not easy for me to figure out which of those clusters correspond to my types. Mayka worked out a correspondence on 29 March, warning me that the correspondence is not exact. Some of the Russian clusters are broader than my types; some are narrower. Here are Mayka’s findings:

19 Sep 2010 update: A nice tree display of the Russian subdivision of R1a is at www.r1a.org. Robert Sliwinski brought this site to my attention.

My opinion: R1a cannot be highly subdivided with confidence based on STR data. This web site of mine is dedicated to estimating the confidence of each type that I study. I try to indicate which types are speculative. Even for the types with high confidence, the location of the nodes in the R1a tree will be uncertain until corresponding SNPs are discovered. These Russian clusters, apparently by Klyosov, have plus / minus values for accuracy of TMRCA ages that are far to small, because there are serious caveats associated with systematic statistical uncertainties.

Here is a summary of terms (in boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see the Fall 2009 issue of JoGG. By haplospace I mean multidimensional sets of STR values; each haplotype is a point in haplospace.

A cluster qualifies as a type if the graph of step frequency (number of samples at that step) vs step looks like an isolated mountain. The step is the genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate step. The cutoff is the next step just beyond the mountain. A good type has low step frequency in a “gap” of step values including the cutoff (only the cutoff for a gap of 1). In other words, the cluster forms a mountain at step values less than the cutoff, separated by a gap from the rest of the database (the parent haplogroup usually) at higher step numbers.

The Statistical Background Percent (SBP) is an objective measure of the quality of the type. Low SBP is taken as evidence that a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP (yet to be discovered). Larger types with lower gaps have lower SBP. SBP is intended as an estimate of the background percent of samples in a type that really do not belong to the corresponding hypothetical clade. SBP is increased to account for the estimated probability of outliers from other clades. An outlier is a sample that has very unusual STR values due to the luck of mutations. SBP is also increased to account for the estimated probability of small foreign clades that just happen to have the same STR values but are not closely related to the type. The SBP is also increased to provide the rough equivalent of the maximum in a confidence interval. Small sample counts have wide confidence intervals. So larger types (more samples) automatically get lower SBP. For a valid clade, SBP should decrease with time as data accumulates in a database. A very well isolated clade will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely to be a clade. SBP < 25% is good enough to be announced on the web. SBP > 25% is a cluster worth watching as data accumulates with time, although I avoid using the word type for SBP > 25%. SBP > 50% is not statistically meaningful although such clusters might improve as data accumulates. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look like mountains. The number of markers in the definition should be chosen to provide as small an SBP as possible; my Excel tools provide automatic rank of markers as an aide; human judgment can be used to include or exclude markers with obvious problems. A signature is a small set of markers that rank best, convenient for publication of a type, and for simple demonstration of the correlation of STR values.

I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes that differ from the modal haplotype by step less than the cutoff. The definition of a type is the modal haplotype plus cutoff. The definition uses only those STR markers that provide the lowest SBP, but the definition uses as many STR makers as possible if there is a tie. The definition of a valid type may change slightly as data accumulates.

Here are some common terms (in boldface) for genetic genealogy. I did not define these, although I use them in a restricted sense: A marker (also “locus”, plural loci) is a DNA location for an SNP or STR or other kind of mutation. A haplotype is a set of gene values at any number markers, here restricted to Y-DNA STR values. I use the word sample (plural samples or data or database) for the Y-DNA STR values from one man. A sample is also commonly called a haplotype, but I avoid calling a sample a haplotype to make it clear that a haplotype may or may not be present in a particular database of samples. A clade is a general term for common descent, so an SNP haplogroup is one kind of clade. I use the word clade in general, meaning a Y-DNA clade that may or may not be a defined official haplogroup. All types have associated hypothetical clades, but most clades cannot be isolated as types with low SBP. A cluster is a set of samples with similar STR values. All types have associated clusters but not all clusters are associated with types. The modal value for a marker is the most common value in the cluster. The modal haplotype is the set of most common values, usually the most common haplotype in a cluster. Many people use the adjective “modal” as a noun, meaning “modal haplotype”; so do I; I tried to avoid that in this web document.

The rest of this topic provides discussions and more definitions that not part of my Mountain Method. These are discussions and terms that I use often, so I provide them here for easy link reference from my web pages. Some of these terms are not common in genetic genealogy. Some of these I do not recall seeing used in documents at all, so they might be my inventions, although I suppose other writers may have used these terms with similar meaning:

A bimodal marker has a second STR value with many samples - more than expected statistically - in addition to the most common modal value. A multimodal marker is possible if there are more than two common values for the marker and if those common values are not distributed more or less symmetrically on both sides of the most common value. (A Bessel distribution is statistically expected for a low fraction of random independent mutations at an STR marker. A Bessel distribution is close to a Gaussian distribution for a high fraction of independent mutations. A Bessel for a low fraction looks like a tent; a Gaussian looks like a bell.) Step up mutations are more common than step down for short STRs, so for example a modal 8 plus a few more 9 values than 7’s does not necessarily mean the 9’s are statistically significant; experience helps to judge. RecLOH and other issues at compound markers also cause confusion in this regard. A bimodal marker is a hint that there may be a clade associated with that 2nd value, so genetic genealogists study clusters defined by one or a few such bimodal 2nd values. The main modal value also sometimes makes a good signature at a bimodal marker. In other words, a set of values using one or more bimodal or multimodal markers makes a good signature for a hypothetical cluster.

In the past, I have sometimes called such clusters hypothetical types. I now prefer to reserve the word type for < 20% SBP, which Mayka and I take as evidence for 80% confidence that more than 80% of the samples belong to a clade that will someday be confirmed as a haplogroup by a newly discovered SNP. Sometimes we make exceptions above 20%, for example when a cluster is regionally concentrated, or associated with an ethnic group.

I had sometimes used “bimodal marker” for that second STR value, but I try to avoid that confusion. It’s the STR marker that is bimodal, with two common values.

There is no known way to calculate the % confidence that a cluster corresponds to a clade, but an experienced genetic genealogist can roughly estimate confidence based on experience. I developed SBP so that 100% minus SBP expresses my confidence, but only for clusters with less than 30% SBP; SBP breaks down around 50%. I avoid publishing clusters in which I estimate less than 50% confidence, although I may mention some as speculative.

Not all Y-DNA STR data separates into types because the distribution of STR values tends to be continuous. A type corresponds to a clade that experienced a population bottleneck - isolation or migration or very rapid population growth.

A main branch of the Y-DNA tree is old, with data on the web for thousands of samples belonging, and with many known further branching divisions. I like to use the word twig for a small young branch of the Y-DNA tree. A terminal branch is a smallest known division of the tree; a terminal branch might be a haplogroup, or a type, or a hypothetical cluster; a terminal branch at one web site might not exist at another web site; a terminal branch might be very small (one or only a few samples) or very large (many samples).

By the age of a clade (haplogroup or type or hypothetical cluster) I mean the TRMCA. By definition, a TRMCA corresponds to a node in the Y-DNA tree, where two clades branch. (Sometimes more than two clades meet at one node, but we expect future SNPs might resolve that node into multiple nodes with two clades each.) An SNP is probably older than the TMRCA of the haplogroup it defines, and the node for two SNPs is probably older than either SNP, because there are usually many generations between old nodes, due to the statistical pruning of the Y-DNA tree (Y-DNA clades tend to die out statistically). The probability is very low that an old SNP mutation happened in exactly the same generation as the TMRCA. (An exception would be a recent private SNP found in an extended male line family.) I call the segments between nodes smooth branches, where there are no known nodes in that segment of the Y-DNA tree. A long smooth branch in the Y-DNA tree is one way to visualize isolation in haplospace. Any type, because it is isolated, probably has a long smooth branch older than the type. A smooth branch is necessarily a statistical estimate, because it is not possible to be sure a branch is smooth; the evidence is multiple equivalent SNPs, or less than usual STR variation. In addition, there may be small branches with living men who have not registered Y-DNA data on the web. So a “smooth” segment really includes the possibility of very few small branches. The metaphor of a tree is appropriate, because a large branch with very few twigs looks smooth from a distance; a smooth branch in an old tree was not smooth many years ago, but the twigs in that segment have died and fallen off the tree over the years. A Y-DNA branch can be smooth in one database (like the Polish Project) and not smooth in a larger databases (like Ysearch, if significant branches in that segment are rare or absent in Poland). All this paragraph applies to hypothetical clusters, but with lower confidence.

This topic provides examples of the complications that arise concerning probability estimates for Y-DNA STR based predictions.

The modal haplotype is 13, 25, 17, 10, 10, 14, 12, 12, 10, 13, 11, 30. See Haplotypes.xls for more detail.

In the Polish Project (data on 7 Aug 2013), 71 samples (men) have this 12 marker haplotype.

Of these 71, 18 have been tested for the SNP L260, and one came out negative; the rest are all positive. 17/18 = 94.4% positive. On this basis, for samples with this haplotype which have not been SNP tested yet, we might predict that that 94.4% of will belong to the L260 haplogroup, coming out positive for the L260 test.

By the way, that one negative sample is from my maternal cousin, representing my mother’s father’s male line. In 2007 my cousin’s home page at FTDNA reported 80 perfect matches at 12 markers, so it seemed like he belonged to the largest Y-DNA clade in Poland. I now know, with his L260- result, that he belongs to a small, rare clade and he matches L260 at 12 markers just by luck.

This example shows that a perfect match at 12 markers does not guarentee that the samples are all in the same haplogroup branch. Actually, my cousin had no matches at the FTDNA site at 25 markers in 2007, and at 37 and 67 markers his STRs are very different than P type STRs, so I suspected even before the L260 SNP was discovered that he belonged to a different haplogroup branch.

My cousin now (Aug 2013) has 313 perfect matches at 12 markers, representing P type at 12 markers from the large FTDNA database, but these do not have SNP data published except when they join a project, like the Polish Project. Most men do not join projects, so most Y-DNA data details are not available to us.

One sample out of 18 is not very good statistics. We have low confidence in such a small sample. If we test 18 more of those 71 samples we might get zero samples L260-, or 2, or 3 samples - just by the luck of sampling statistics. There is a standard way to calcuate the confidence - Poisson statistics.

(For a small sample, Binomial statistics should be used. For a small result out of a large sample, Binomial is the same as Poisson; for one result out of 18 both Binomial and Poisson give about the same statistics. I have a Poisson calculator handy in the SNP sheet of my Type.xls files, so I’ll use Possion for this example. By the way, for a large result out of large sample, both Binomial and Poisson are the same as Gaussian statistics, also called the Normal distribution, or the bell curve.)

In other words, if test for L260, with 100 more samples matching the 12 marker P type modal haplotype, this calculation predicts 95% confidence that 69 or more of them will be L260; and 90% confidence that 74 or more of them will be L260; and 80% confidence that 78 or more of them will be; etc.

OK. So far this is all standard first year college statistics calculations, demonstrating that probability calculations do not have high confidence with a small number of samples. For very large samples, the confidence is very close to the probability in this kind of calculation, but when studying Y-STRs we rarely have large samples, because as more data become available more haplogroups are discovered, so we end up working with newer, smaller haplogroups.

For STRs, I like to use a single confidence number, as a smiplification: by 80% confidence I mean 80% confidence for 80% or more; by 90% confidence I mean 90% conficence for 90% or more; etc. This is my own simplification; I have not seen this simplification used in statistics books. This simplification only works when the upper limit of the confidence range is close to 100%, which is usually the case for STR predictions. Notice that last line in the listing above: one out of 18 means 79% confidence that 79% or more will be L260. So using my simplification, I consider that I am 79% confident that 79% or more of future P type modal haplotypes at 12 markers might come out L260, based on this calculation example so far.

Haplogroup L260 is a subdivision of haplogroup M458. M458 is the “father” of L260. All L260+ samples must be M458+ (except for very rare instances of a back mutation). Of those 71 samples that we are discussing here, considering the ones not tested for L260, one of them is M458-, so that sample is almost certainly L260-. That makes two samples, not one, among the 12 marker P modal samples, that are L260-. It seems that instead of 17/18 = 94.4%, we should be using 17/19 = 89.5%. However, that’s not fair. There are several more samples not tested for L260 that are M458+; we just don’t know how many of those might be L260-. It is a good idea to check the father branch (and grandfather branch) when doing a study like this, but it is not easy to adjust the statisitics. Anyway, there is a more important objection in this particular case, next topic:

Both of those samples, that L260- and that M458-, have the same last name, Iwanowicz. They are both maternal cousins of mine; I recruited both of them, paid for their testing, and enrolled them in the Polish Project. I got that M458- result before the L260 SNP was discovered, then later I tested the other cousin and got the L260- result. Since I recruited both of them, they should count as only one sample, not two. I call this a family set.

When doing statistical analysis, I count family sets as only one sample, unless 2 or more of them tested DNA and joined the Polish Project independently. In P type, there are 7 samples with the name Lapinski; by email response, Lukasz Lapinski informed me that he and two others tested and joined independently, and he recruited the 4 others, so I count this set as 3. Only 4 of them match the P modal at 12 markers (the others differ by one or two markers), and of the 4 only Lukasz tested for L260, so this interesting example does not affect our calculations for those 18 samples discussed above.

The family name is one way to spot potential family sets. I have other ways. I have ways to check if they really are family sets. I won’t cover the details here, I’m just pointing out that I edit family sets when calculating probabilites and percentages; sometimes this editing causes significant differences in the calculations. Check my publication for more discussion about family sets.

Using the set of 25 standard markers, only 4 of those 71 samples match the P modal, as we expect because the additional 13 markers are rapid mutators. Analysis at 25 or 37 markers is interesting, but I’ll jump to 67 in this discussion. Of those 71 that we have been discussing, 11 have exactly 12 markers, 3 have exactly 25, 18 have 37, 31 have 67, and 8 have 111, so 39 samples, more than half, are available at 67 markers.

Of the 1423 samples with 67 markers in the Polish Project, 707 (about half) are R1a (data download 7 Aug 2013). (For samples not SNP tested, even with only the minimum STR 12 markers, FTDNA does an automatic prediction; R1a is predicted automatically by FTDNA with better than 98% confidence.)

Of the R1a samples with 67 markers, xx have tested L260+ and xx have tested L260-.

See Polish Project Assignments for a brief overall explanation of how assignments are done. This topic provides more detailed discussion. This topic focuses on the R1a categories, but most of this discussion obviously applies to other categories.

Each sample (individual man) is assigned to a category. Many categories are known haplogroups or paragroups. Haplogroups are defined by SNPs, but not all haplogroups are supported by FTDNA assignments, which may cause some confusion.

Some categories are types, which are hypothetical haplogroups. Borderline and cluster categories are discussed near the bottom of this topic. Click on Remainder and Unassigned for discussion of those two categories elsewhere.

The assignment guideline is at least 80% probability for each individual sample. Using an 80% minimum, most assignments are better than 80%, of course. So the average probability for a category is higher than 80%, and the average varies by category depending upon how many samples are marginal near 80%.

For haplogroups, “80% probability” means that if a large number of samples with 80% probability were SNP tested, about 80% of them would test positive for the haplogroup into which they were predicted. Probability is determined by correlating STR values with samples that have been tested for that SNP.

Some assignments are 100% probability - samples with positive SNP test results, assigned to that haplogroup, and not given an extended assignment. Actually, there is no such thing as 100% because the genetic test might be in error, but it seems from experience that testing errors are much lower than 1%.

I arrive at probabilities with a combination of calculations and educated estimates. This topic is my explanation.

I figure probability as a decreasing function of step from a modal haplotype. My assignment rules are step distances at which I figure 79% probability. If a sample matches the modal haplotype at less than the 79% step distance, I assign that sample to the corresponding haplogroup or type or other category. In practice it’s complicated. I use an Excel file for assignment. You can view the file at www.gwozdz.org/PolishCladesUpdate/Assigner.xls. That may not be the current version. In that file the “PolishProjectRules” sheet has the list of rules for human reading - next to the coded logic functions forExcel. If you are a Polish Project member you can find your kit number and view your step to each category in the table - “Modal Calculator” sheet.

The following paragraphs explain how I figure probability for types. This is not something I proved in my publication, but it seems to me that my publication makes it reasonable. I hope you the reader find the following method reasonable. I expect this method will be proven with time as most of my predicted types are validated.

If a type has 90% probability of being valid and a particular sample has STR values that match the type with 90% probability, those two numbers get multiplied for net probability. That particular sample has 81% net probability of validity, and 19% probability of invalidity. I do not actually calculate this. This paragraph is a conceptual explanation introducing the explanation in the following paragraphs.

My publication has detailed discussion of my statistical method for types. Briefly, I use SBP as a quality measure. SBP is a measure of the background - the percent of samples that match the type but really do not belong. For example if SBP = 15%, that means 15% is a measure of how many samples within the type (step less than cutoff) really do not belong to the type. For this example, a typical sample in the type has 85% probability of really belonging to the type.

It is not possible to calculate the probability that a type really is a clade that will be validated some day by an SNP not yet discovered. Although 100% minus SBP is not the probability of type validity, 100% minus SBP is closely related to validity. Certainly a type with high SBP has low probability of being valid. Certainly a type with SBP less than 15% has high probability of validity.

SBP is a high calculation, designed for roughly 70% confidence interval, with additional increase for many statistical reasons explained in my publication. That’s why I call it “Statistical Background Percent”. This statistical increase is small for small SBP and larger for larger SBP. The way SBP is calculated, it goes over 100% for type candidates with high background; SBP should not be used over 50%.

The best estimate for background percent is lower than SBP. However, as explained a few paragraphs above, the net percent of invalid samples (net invalidity) is higher in the cluster of a type, because of the unknown probability that the type itself is invalid as a whole. It is convenient for me to assume these two considerations cancel each other. I use SBP as my estimate for the net background percent of invalid samples in a type.

A sample that matches the modal haplotype has close to 100% probability of belonging to the corresponding type. For a type with a high cutoff, this is true even for a sample a few steps away from the modal haplotype. The reason is that the vast majority of haplotypes in a type are at the highest step numbers, so that is where most of the background is. This is explained in the discussion of Table 1 on page 145 of my publication.

So here is my method: I figure an assignment rule “step < S” to assign samples, where the samples at step S and greater, equal to about SBP percent of the type cluster, do not get assigned.

This finishes my brief justification for using SBP as a guide for assignment. More discussion of details:

There are other calculations in addition to SBP, for example haplogroup correlations mentioned above.

Another is the calculation of correlations for 37 marker rules, which are similar to haplogroup correlations. Using 67 marker data for a type, the 37 marker data for those samples provide probabilities that other samples with only 37 markers belong to this same type.

After I do a particular calculation many times, I feel confident glancing at new data and making quick estimates for new rules if the number of samples does not justify detailed calculation.

Let me repeat what I said above: I arrive at probabilities (assignment rules are 80% estimated minimum probability) with a combination of calculations and educated estimates.

Mayka, who does the assignments for most categories other than R1a, does not use my calculation methods, but insofar as he uses his experience to judge STR correlations, he is really performing estimated correlation calculations.

When a probability is judged close to the 80% minimum for assignment based on STR correlations (step close to the rule limit), there are a number of additional factors that can be considered. The following paragraphs are examples. More examples are in my publication. Mayka uses similar considerations for assignments:

Geographic concentration. P type is an example. P type is concentrated in Poland. I considered P type as more likely valid because it is geographically concentrated, before it was validated by an SNP. Back then I considered a Polish family name associated with a sample as marginal additional evidence of belonging to P type. Today that consideration applies to a sample that marginally matches the P type haplogroup with STR values but has not been measured for the L260 SNP.

Ethnicity. For example, there are a number of haplotypes known to be common among Jews, so a Jewish name associated with a sample is marginal additional evidence that the sample belongs to a corresponding haplogroup or type.

Stragglers. We tend to avoid categories for only one or a few samples, so if one or two samples have 70% probability as a best estimate it makes sense to adjust the rule a little looser so that the rule picks up those few samples that do not quite fit, rather than create a borderline category (discussion below). Conversely, it makes sense to be a bit stricter for type assignments if a borderline category is available.

67 markers. We are marginally more liberal with assignments using the full 67 markers and marginally stricter for samples with fewer, because those with fewer can get more accurate assignments by procuring the remaining markers.

Men with closely matching STR should be classified together, particularly if the family name is the same.

We avoid changing assignment rules too often, so some assignment rules may remain in place for a while even after new data has provided slightly better rules.

For a valid type SBP comes down as data accumulates, with better statistics. I avoid introducing a new small type with SBP above 25%, because I expect it to improve with time. Technically, SBP = 40% means 60% of the samples can be introduced as a new type category, but I prefer to wait a few months for more data, so that a new type is substantial at introduction.

For some types, many of the samples near the cutoff have already been assigned with high probability to another type. So those assigned samples should not be included in the SBP calculation. K type is an example. Although my published SBP for K type is 26%, many samples at the cutoff are assigned with high confidence to other types, including many P type that have tested positive for the L260 SNP. The true background for K type is much less than 26%, although I have not taken the time to do an adjusted SNP calculation.

We do not wish to be dismissed by others with experience evaluating STR data. On the other hand, we do not wish to have others point out that samples are being left without obvious assignment. I suppose the goal should be that the number of people complaining that assignments are too liberal turns out to be about equal to the number of people complaining that assignments are too conservative (people with experience evaluating STR data who have read and understood my documentation here).

A person who assigns samples to hypothetical haplogroups based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by reality very quickly. Probabilities of an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based predictions. In the past, a number of STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you can read as much as you wish about our (Mayka’s and my) methods, judging for yourself the reliability of our probability estimates.

{This entire topic needs rewrite. This is an old version. I moved the probability discussion to a new topic, above. Much of this topic is OK as is for explanation of “confidence”, but most is redundant. Watch this space for a rewrite.}

This topic is about confidence. I’m not trying to be statistically exact here. I’m just trying to explain a point that may not be obvious to everyone: Confidence is not the same as probability. For example, I could calculate a 90% probability of no rain today based on data showing that on this day in this place, over a large number of years, it only rained on this day for 10% of the years. However, if I can see storm clouds in the distance, I have much less than 90% confidence of no rain.

My minimum 80% probability rule for assignments also means minimum 80% confidence. I give an example in the next paragraph of one method to calculate confidence. However, most of my confidence for assignments are based on educated estimates, not exact calculations.

Confidence interval example: By 80% confidence I mean 80% is the lower number of the 80% confidence interval. For example, 80% confidence might mean that the actual probability is 90% but the 80% confidence interval is 80% to 96%. In the following paragraphs I

As an example, consider a situation where 10 samples match a type with an STR test. Suppose there is a definitive SNP test available, and 9 of those 10 samples test positive for the SNP, and 1 negative. That means 9 of the 10 really belong to the haplogroup and that 1 mismatch must come from a different haplogroup that matched the STRs by the luck of mutations. Next, consider a new sample that matches that same STR test. What is the confidence that the new sample will pass the SNP test for the haplogroup? The probability is 90% because we know that 9 out of 10 previous samples like this matched the SNP. However, 1 out of 10 is a very small sample. As explained in my publication, I use Poisson statistics for quick calculation of confidence interval. Poisson statistics is simple to calculate in Excel. My tool Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.

80% confidence interval of 1 is 0.11 to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80% confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%; that lower number 61.1% means the 80% confidence ranges to lower than 80%, so net confidence is lower than 80%.

70% confidence interval of 1 is 0.16 to 3.37, which is 16% to 33.7%, lower number 66.3%; net confidence lower than 70%.

60% confidence interval of 1 is 0.22 to 2.99, lower number 70.1%; confidence higher than 60%.

67.3% confidence interval of 1 is 0.18 to 3.26, lower number 67.4%. So that’s my one number: 67% confidence.

In other words, if 9 out of 10 samples that match an STR also match the SNP test, we have at least 67% confidence a particular future sample matching the STR test will also match the SNP test.

For 18 out of 20, the probability is still 90%, but a similar calculation shows 75% confidence.

For 36 out of 40, the probability is still 90%, but a similar calculation shows 80% to 96% confidence interval, minimum 80% confidence, which is my example that I started with above. These calculations take less than a minute using my Excel cells.

Statistical Background Percent: SBP. I use SBP as a net confidence estimate for the background (samples that match the STR values but really do not belong to the clade of a type). My publication does not go into the details of confidence intervals. That is the purpose of the explanation here in this topic. SBP is my estimate for the net statistical confidence before any SNP has been discovered to validate a hypothetical type. 100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the corresponding hypothetical clade.

A mountain cluster corresponding to a type might include outliers from other clades, or might include foreign clades. These and other caveats associated with STR prediction are discussed in detail in my publication, where I point out that the confidence for all such caveats cannot be calculated. I estimate the background by using the low frequency of samples in the gap as representative of the background throughout the haplospace neighborhood. My SBP formula (available in the tools) includes an increase in SBP to account for all such caveats.

Part I of my publication explains: “Much of the background is probably at the last step of the mountain, just before the cutoff. Much of the remainder is probably at the previous step, much of the remainder after that at the previous step, etc.” My Part I Table 2 justifies this by demonstrating how the number of possible haplotypes increases very rapidly with step. In other words, SBP is a good worst case overall estimate of background percent within a type, but background percent is very low at step zero and increases rapidly with step. My publication does not provide a formula for background vs step and in fact I have not derived an formula. For assignment of samples, I estimate the confidence vs step in a manner to provide a rapid decrease in confidence near the last step, in a manner to produce overall confidence roughly equal to 100% minus SBP. Step zero is my rough estimate that the type is a valid clade, since the step zero samples belong to the clade with very high probability if the type is valid.

Some outliers from the type statistically fall within or even beyond the gap, so confidence is not zero at the cutoff.

Confidence also depends upon the size of the gap. A wide gap with zero samples means even samples in the gap near the mountain have reasonable confidence percent.

Estimates vs Calculations vs Adjustments: A person who assigns samples to hypothetical clades based on STR values acts like a bookie who provides advance estimates for gambling odds, using a combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by reality very quickly. Probabilities of an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based predictions. In the past, a number of STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you can read as much as you wish about my methods, judging for yourself the reliability of my estimates and net probabilities.

The first confidence interval example above, confidence of STR predictions calibrated to SNP data, can be pure statistical calculation without any estimates. However, judgment is involved. Even such SNP predictions should be split into parts based on the step value of the samples within a type. However, if split down to individual steps, the statistics are very poor due to small sample size, so steps are best combined in batches. For the first data from a new SNP it is necessary to combine all the steps, so the predictions benefit from an estimated confidence by step. So the judgments and calculations can get quite complicated, and often I just estimate the confidence from experience rather than do the calculations every day as data comes in.

I avoid changing assignment rules often, so some assignment rules remain in place even after new data has provided better rules.

My standard is 80% confidence, but I avoid introducing a new type until the confidence is a bit higher, because a new 80% confidence type would provide only a few samples at step zero on the day when enough data has accumulated. After waiting for more data, I tend to bend the guidelines a bit below 80% confidence in order to introduce more samples with a new type. Also, if I notice an individual coming out at 75% when I’m updating rules I’ll tweak the rule to include him.

I tend to be generous in estimates for samples with all 67 markers, and I tend to be conservative with samples having fewer than 67. I update the rules more often at 67. After all, samples with fewer than 67 markers can get much better confidence by ordering more markers, and 67 is the most available as a standard commercial test.

I do not look forward to a man feeling slighted when he is not assigned to a type that is a reasonable fit to his STR data. On the other hand, I do not wish to be dismissed by others with experience evaluating STR data, so I try to be conservative in my probability estimates that particular clades in fact exist. I will have achieved my goal if the number of people complaining that I assign too liberally turn out to be somewhat greater than the number of people complaining that I am too conservative (people who have read and understood my documentation).

Naturally, my confidence changes from month to month as more M458 and STR data accumulates, for better statistics.

Assignments at fewer than 67 markers: There are two ways: Some types have low SBP and seem 80% valid using 37 or only 25 markers, at least for samples at low step, so samples can be directly assigned.

Second way: I check for correlation using the samples with 67 markers to see which percent of samples at given genetic distance using fewer markers end up in the corresponding type at 67 markers. The confidence of a sample at fewer markers is that confidence multiplied by the corresponding confidence at 67 markers.

I look forward to the discovery of SNPs validating more than 80%, probably more than 90%, of my R1a Polish Project type assignments.

I introduced P, N, and K types in the Fall of 2007, publishing this web page 6 Dec of that year. I did not predict that P and N were brother clades, in fact it looked to me like P was closer to K. I did not make predictions about the P, N, K structure because the statistics did not justify such predictions. I assigned samples to P and N with 80% probability, remarking that my overall confidence that P and N were valid (confidence at step zero) was 95% in 2008. I stated my overall confidence in the subtypes of K type as only 80%, but again my confidence in K type at step zero was (and still is) 95%.

N type is very close to the same as R1a1a7*, the paragroup defined by the SNP M458 minus L260. This is not exactly a validation, because there are a low percent of M458 (2 samples so far at 67 markers) that seem to be older than N type, which implies that a future SNP, younger than M458, may be discovered as equivalent to N type. In previous versions of this document, I explained: “A new SNP marker may not fall at the node defining a type.” A new SNP might be younger, including mostly the samples with low step from the corresponding type. A new SNP might be older, including the corresponding type plus some samples with step beyond the cutoff for the type.

In Fall 2007 I also introduced R (Remainder) as the 4th division of Polish R1a, for those samples that do not fit P, N, or K. K type plus the R category are equivalent to R1a1a* (M17, M198, M458-). The R1a table assigns new types to either K or R. In the detailed discussion of the types I discuss which types have: (a) high confidence as subtypes of K; (b) high confidence as not subtypes of K so surely go into R; and (c) lower confidence of assignment to K or R so are assigned with a best guess. A new SNP for K type might include a few of these subtypes, and may include some of R, depending upon the age of such a new SNP.

This topic uses R1a as an example, but the same discussion applies to other haplogroup assignments.

My publications have several references of general interest and relevance to my web documents.

My Tools and data for STR analysis are Excel files. These are available at the JoGG publication site as Supplementary Data: www.jogg.info/52/files/cpcindex.htm.

Pawlowski (2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my publications. I specifically mention it here because this is where I originally found the common Polish haplotype that I now call P type. Link to English abstract: Pawlowski 2002.

Lawrence Mayka is the Administrator of the Polish Project. Larry helped me to get started when I was new to genetic genealogy, providing helpful criticism & suggestions. He reviewed & approved my 80% probability rule for assignments on the Polish Project web page. He also reviewed the original drafts of my publications. A number of my types were originally suggested to me as STR clusters by Larry. Larry continues to provide data for this web page. Many of my references to other websites in this document were suggested to me by Larry.

Cyndi Rutledge is the administrator of the R1aY-Haplogroup Project. Larry and Cyndi had been sending me M458 test results when that SNP was new. SNP results are now listed at project web pages.

Anatole Klyosov published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of publications. Some of the STR types that I independently discovered I later found as 25 marker modal haplotypes in Klyosov’s web documents (before his publication in JoGG - some in Russian). It was encouraging to me seeing independent identification of clusters by different methods. He emailed to me an English version of one of his 2008 publications. His Fall JoGG articles have references to his other publications. Here is a web link: Klyosov Home.

Kenneth Nordtvedt published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions, suggesting many STR based clusters.

FTDNA link: www.familytreedna.com. This is a commercial DNA testing company. I make extensive use of the project databases maintained by FTDNA. These are my primary sources of data. Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted for /polish/ in the following URL.

WTY. “Walk Through the Y”. This is a commercial product by FTDNA, for reading about 200,000 base pairs of your Y chromosome, in a search for new SNPs in your branch of the Y-DNA tree. Here is a direct link to a WTY description. You can read about my WTY at another of my web pages.

Polish Project link: www.familytreedna.com/public/polish. One of many FTDNA projects. This is my primary source for Polish data. The Polish Project tracks both Y-DNA and mtDNA; click on “Y-DNA Results” on the left to see the data that I use.

R1a Project link: http://www.familytreedna.com/public/R1a. Newer R1a project, with multiple co-administrators, active in subdividing R1a data into hypothetical haplogroups. The project home page has a summary chart of R1a SNP sudivision, and other reference links. Lapinski emails to me correlations between my code names and the code names for this project; here is the Aug 2011 update.

Ysearch link: www.ysearch.org. Ysearch is the largest web database for Y-DNA, run by FTDNA, open to all men, including men who also register with projects and including men with data from other testing services. I use Ysearch often for analysis so of course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your data with Ysearch. Or you can type your Y-STR data into Ysearch. I am not associated with the company FTDNA. I have Instructions for comparing your STR data to my types (modal haplotypes) that I have entered into Ysearch.

Yhrd link: www.yhrd.org. A forensic Y-DNA data base. Data is separate by city, with many Polish cities. I relied on Yhrd to figure out the geography of the various haplotypes. I wrote a Yhrd Reminders for myself so that I won’t forget how to navigate the Yhrd web site; click on that link if you need some hints.

ISOGG link: http://isogg.org/tree/ Y-DNA tree with the most recent SNPs and corresponding alphanumeric codes.

recLOH: A technical detail discussed in many publications, for example http://en.wikipedia.org/wiki/RecLOH. I discuss this and other compound marker issues, and how step is calculated, in the “Documentation” sheet for my Calculator.xls tool.

DYS389: Another technical detail, also discussed on the web and in my Calculator.xls. Briefly, 389II is the sum of 389I plus another STR, so 389II should be figured in terms of the delta value.

Chandler mutation rates. Mentioned in my publication. From Chandler, Fall 2006 www.jogg.info, 37 markers. 67 marker extension on line at mutation rates.

I’m a very rare type in Poland - E1b1b1a2. My maternal 1st cousins are R1a1a. That means my late maternal grandfather was R1a1a. I became interested in Y-DNA in 2004. My maternal family name is Iwanowicz. I discovered a family with that name in my maternal grandfather’s home town in Poland. They are the only Iwanowicz family within 50 miles, so I was suspicious they might be my 3rd or 4th cousins. I brought a cheek swab kit when I visited them the second time in 2006. Sure enough, the male son is a perfect 25 STR marker match to my 1st cousin. I didn’t get around to checking the web for a year. I was shocked to discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12 STR markers. That’s a hell of a lot of matches in the summer of 2007. Most of these matches are Polish. I did some research and found an article by Pawlowski(reference in my publication) about this most common Polish haplotype, which I now call P type. That got me interested in doing more research, leading to this web page for others to see my results. My experience, however, is a reminder that statistics can be misleading. I was confident that my grandfather’s haplotype was P type, based on a perfect match at the first 12 markers. I now (June 2010) figure that the probability was really about 93%, because 13 out of the 14 current Polish Project members who have 67 markers and who also match P type perfectly at 12 markers are in fact P type as judged by all 67 markers. My grandfather does not match P type at 67 markers. My grandfather is that 14th one. He matches the small hypothetical clade that I call I type, which is also concentrated in Poland. But my confidence on that I type assignment is only 80%, so maybe statistics is fooling me again. That’s how an outsider ended up studying P type and R1a1a, and writing web pages and articles about common Polish Y-DNA clades.

2012 Mar 17 Ysearch instructions updated to include Haplogroups I, N, R1b; very limited results

2012 Oct 20 update of both Abstracts, official drop of K and R categories, update of U, a few other minor updates

Polish Y-DNA Haplogroup Summary Table	Short overview of all major haplogroups in the Polish Project
Results Table	All proposed clades; many with Ysearch links
Explanation of the Results Table	Discussion of my methods
Ysearch Comparison	Instructions and links for comparison of your Ysearch data to these proposed clades
Haplotypes.xls	Table of STR Modal Haplotypes, Definitions, SBP, etc.