Wikipedia:Featured picture candidates/Anscombe's quartet

Source: Wikipedia, the free encyclopedia.

Anscombe's quartet

regression line and correlation coefficient. It illustrates the importance of exploring data graphically and the effect of outliers
.
Alt 1 - no labels
Alt 2 - with labels, title added. The labels are misaligned, but this can be fixed if this approach gathers significant support.
Alt 3 - with subscripts for the x and y variables
Alt 4 Needs some corrections, but is a more explanatory = encyclopaedic way of presenting. Text is from Anscombe's original publication.
Reason
A nicely executed graph, with high EV. Well documented, with sources and source code provided.
The individual scatterplots are described and contrasted in our
Correlation and dependence
article.
Articles in which this image appears
Correlation and dependence, Anscombe's quartet
Creator
R Development Core Team, Schutz (original version), Avenue (alts 1-3), Papa Lima Whiskey (alt 4)
  • I've struck my support here to indicate my preference below. -- Avenue (talk) 22:27, 28 March 2010 (UTC)[reply]
Could you be more specific about what you dislike about it, please? --Avenue (talk) 01:26, 22 March 2010 (UTC)[reply]
The subject's resolution isn't the best. The Utahraptor (talk) 01:34, 22 March 2010 (UTC)[reply]
It's an SVG. Resolution is not that great a deal since it can be scaled to any size. --Muhammad(talk) 01:37, 22 March 2010 (UTC)[reply]
OK. The Utahraptor (talk) 01:39, 22 March 2010 (UTC)[reply]
Just to expand on Muhammad's point, you can easily view it in higher resolution by clicking on the "2000px" link under the image shown on the file description page (or just click here). -- Avenue (talk) 05:49, 22 March 2010 (UTC)[reply]
If you are still opposing, Utahraptor, could you please say why? I think we have addressed the only concern you've raised. -- Avenue (talk) 05:49, 24 March 2010 (UTC)[reply]
  • Hmm. Given that these are not real measurements, but consist entirely of made up numbers, they have no meaning besides their appearance in this quartet. You could label them "First independent variable in Anscombe's quartet", "First dependent variable in Anscombe's quartet", and so forth, but (together with even the briefest caption) that would convey no more information than x1, y1 etc. I would have more sympathy for the argument that the axis labels and numbers are superfluous and should therefore be dropped. I'm happy to provide an alternative version along those lines, if anyone agrees. --Avenue (talk) 00:00, 24 March 2010 (UTC)[reply]
Are you suggesting that the text "Anscombe's quartet" be added within image itself? Seems a bit pointless. The image has an image description (and filename, although that isn't as useful as it could be) that already provides this information. Labels within the image should really just be used to describe elements of the image, not the image itself IMO. Ðiliff «» (Talk) 14:41, 25 March 2010 (UTC)[reply]
Graphs should always be labelled. General rule of technical writing. Papa Lima Whiskey (talk) 19:51, 25 March 2010 (UTC)[reply]
Sure if it was a graph of something. Here we are looking at the shape of the graph itself, it isn't real data - you could label them example 1, example 2, etc., but I don't see what that gains you. If anything I'd go the other way as Avenue suggests and take out the axis labels. Kmusser (talk) 20:32, 25 March 2010 (UTC)[reply]
That's what we have captions for, IMO. --Avenue (talk) 21:08, 25 March 2010 (UTC)[reply]
  • Support per nom. Time3000 (talk) 10:19, 24 March 2010 (UTC)[reply]
  • Support per nom. Broccoli (talk) 15:54, 24 March 2010 (UTC)[reply]
  • Support per nom, preference for Alt 3. I'd also oppose adding extra labeling in the graphic itself, right now it is language independent which is a plus, if more explanation is needed then add it to the caption. Kmusser (talk) 16:00, 25 March 2010 (UTC)[reply]
  • Note: I've added two alternative versions along the lines suggested. Unfortunately Inkscape has shifted the axis labels a bit in the second one, but I think it illustrates the general idea. I'll fix it up if this gathers much support. -- Avenue (talk) 20:57, 25 March 2010 (UTC)[reply]
    • Surely the image has to at least contain the values for the axes? Otherwise we're left unsure if it's logarithmic or linear. Ðiliff «» (Talk) 21:37, 25 March 2010 (UTC)[reply]
      • Or some other scale, I suppose. It is ambiguous, yes, but I think any misinterpretation would almost have to be willful given the context in which it's displayed and the cues of the regression line and evenly spaced tickmarks. Personally I like the minimalist version best, but I think the original is probably better suited to a general audience. I can add another alternative, with values along the axes but no axis labels, if you would like. --Avenue (talk) 22:31, 25 March 2010 (UTC)[reply]
  • Oppose. Bogus license. Images are not considered derivative works of the software used to produce them, i.e. images are not software. Please see my comments below in the licensing discussion. (The GPL is not an appropriate license for images anyway, as it requires measures unpractical to images in order to meet the conditions of the license.) Kaldari (talk) 21:09, 25 March 2010 (UTC)[reply]
  • The same could be said for the GFDL, I think, yet we happily use that for images. --Avenue (talk) 12:33, 31 March 2010 (UTC)[reply]
  • Oppose all [except Alt 3] due to error: While x1, y1, x2, y2 and so on are frequently seen, this is solely because most graphing programs do not handle subscripts well. The CORRECT way of writing them is x1, y1, x2, y2 and so on. As it is, if anything, the labels on the first can all too easily be read as (for instance) y2 - and y squared would give a VERY different interpretation to the graph. If the original (which is easily the best, as that font can be clearly read in the article, which is not true for any other) can be brought into the proper mathematical convention, I Conditionally support it. Also, I should note that I Strong Oppose Alt 2: While unreadable text at thumbnail size is not normally an issue at FPC, this is an obvious exception: It loses most of its value in the articles if the reader has to click through just to find out which graph is being discussed when. Further, the positioning of the labels is uneven, which is sloppy. y1 is far closer to the graph than x1. I'd go so far as to say y1 is uncomfortably close. We can do better than this. Shoemaker's Holiday talk 14:25, 26 March 2010 (UTC)[reply]
Yes, I warned above that the labelling in Alt 2 was messy, and offered to fix it if necessary. I only provided it as an example to see if others liked PLW's proposed approach. I've now added a note explaining this to its caption.
I've also added another alternative version, with subscripts. -- Avenue (talk) 17:21, 26 March 2010 (UTC)[reply]
I don't oppose the new version, but notation is not nearly as clear cut as you make it out to be. For example y2 doesn't always means y squared. I wouldn't call it an error or wrong as it is.
talk) 22:50, 26 March 2010 (UTC)[reply
]
It's hardly common for it to mean anything else, and subscript notation is by far the standard in mathematics, of which statistics is a branch.
I agree that the subscripts are better in the two contexts where it is used here. It might be different if we were using it in a context where the variable names used in the statistical software were relevant, e.g. as an example of R's graphical capabilities. --Avenue (talk) 22:21, 28 March 2010 (UTC)[reply]
We could create similar datasets to get around this issue if necessary, although this would not be entirely trivial. The graph would then lose some historical value, but it would still have good encyclopedic value in our
Correlation and dependence article. -- Avenue (talk) 09:09, 27 March 2010 (UTC)[reply
]
Since the data presented by Anscombe is only one part of the overall publication, I would argue that use of the data constitutes quotation of an excerpt, which is permitted. Furthermore, providing data for re-analysis is a basic courtesy if not part of professional codes of conduct for pro bono research results. The American Statistical Society, in whose journal the work was published, is, to the best of my knowledge, a public charity; Anscombe was employed by Yale (charity? Harvard is...), and the work supported by public funds via the Office of Naval Research. I think we're on pretty safe ground using the data. Papa Lima Whiskey (talk) 11:40, 27 March 2010 (UTC)[reply]
That's a good fair use justification, but featured pictures should be under a free license. -- Avenue (talk) 21:38, 27 March 2010 (UTC)[reply]
Yes, without a free license, we can't feature any of these. Papa Lima Whiskey (talk) 12:10, 28 March 2010 (UTC)[reply]
Thanks to Noodle snacks for letting me know about this issue. Here is my take on the licensing questions:
First, the question of the GPL: the image is not licensed under the GPL just because I used R to generate it, the reasoning is more complex: the script used to generate the image actually comes from an example script provided with R and I have only made small modifications to it. The image is thus a derived work of this little piece of R code, which is under the GPL. The description of the image indicates this, although it used to be more clear, but some of the latest modifications removed useful information; I'll correct this ASAP. I would much prefer to licence it under a CC-BY-SA licence, but think the reasoning here is correct.
About a possible copyright on the data: the data is presented in the original paper as a table of 66 meaningless numbers; it seems unlikely to me to qualify as a "work" in itself. Or, in other words, it looks difficult to me to separate the idea (not protected by copyright) behind the dataset from its actual expression (the numbers): as mentioned above, we could easily create a similar dataset with different numbers (so the expression would be different), but the resulting graph would still be the same (no individuality). All in one, I don't see a copyrighted work here.
Cheers, Schutz (talk) 19:29, 28 March 2010 (UTC)[reply]
Thanks for weighing in (and for producing the original graph!) I had been assuming that Anscombe's original paper would have included graphs, so if it didn't, that puts a different complexion on things, and I think using some sort of free license for our graphs is justifiable.
Actually, the paper contains the graphs, sorry if I wasn't clear. But the data itself is just the bunch of numbers in a table. The layout of the R graph is quite different from the one from the paper (as different as can be, given that there is only one basic way to do a scatterplot). Schutz (talk) 07:11, 29 March 2010 (UTC)[reply]
I've made various changes to the code to produce the Alt 3 version; for example, I saw no need for two separate loops. But my code is still a derived work and should be under the GPL too (when I finally get around to uploading it). I've now reviewed the example provided with R (available e.g. here), and I think the R project should be acknowledged as a creator here, so I'm modifying my nomination statement above accordingly (and to reflect the various alt versions). -- Avenue (talk) 23:22, 28 March 2010 (UTC)[reply]
A couple points that need to be clarified:
  1. Images are not considered derivative works of the software used to produce them. The copyright in images lies in the expression of the data, not the data itself or the algorithms or programs used to produce an expression of that data. In other words, only people are granted copyrights, not programs. As the producer of the graphs and sole copyright holder of the image, Schutz can choose any license he wishes irrespective of the license of any software used to assist him.
  2. In the United States at least, data cannot be copyrighted, only unique expressions of that data.
Kaldari (talk) 03:46, 29 March 2010 (UTC)[reply]
No, I think Schutz is not the sole copyright holder of his image, because it is a close derivative of the example produced by the R developers. The dots have been enlarged, an overall title removed, and different tickmarks placed on the x-axes, but the colour scheme and layout are identical. --Avenue (talk) 04:35, 29 March 2010 (UTC)[reply]
Yes, the (unknown) person who wrote the original script is an author as well. I think this is made clear in the description of the image (and I have improved it even more yesterday); let me know if this is not the case. Schutz (talk) 07:11, 29 March 2010 (UTC)[reply]
Yes; sorry if I have implied at all that you claimed more than due credit. The fault in initially not giving the R people credit in this nomination is mine alone. I should have checked how closely your graphic was based on their work at the outset. --Avenue (talk) 08:01, 29 March 2010 (UTC)[reply]
Sorry guys, but you're still not getting it right. Programming examples given in teaching materials may be freely used - check the SCO lawsuit, where it was made clear that examples that Kernigan and Ritchie had used in textbooks had thereby essentially entered the public domain, as the purpose of the textbook was teaching, i.e. the intention in publishing the tutorial is to allow others to use it. Hence it is assumed that copyright on the code is relinquished at that point, and the tutee is free to use what he has learnt from the tutorial, without any need for additional variation or obfuscation. I know of no legislation or case law that treats online tutorials differently in this respect, so it should be assumed that the GPL does not piggyback on the teaching materials provided. (However, any part of the tutorial that isn't code may still have copyright attached to it! However, I see no text or illustration that illustrates the logic behind the code, included in the image. The image is just a presentation of Anscombe's data.) Papa Lima Whiskey (talk) 13:41, 29 March 2010 (UTC)[reply]
Even though I would prefer it this way, I am not sure about your reasoning (if it is correct, I would love having a page on Commons describing this, with pointers to the relevant references, as a reference: there is plenty of interesting material we could reuse in this way). But I am not in a position to comment on it at the moment; the best I can do is update the description page of the image to confirm that all my modifications are under CC-BY-SA; this way, if anyone thinks this image could be relicensed, there is nothing in the way of doing so. Schutz (talk) 15:37, 29 March 2010 (UTC)[reply]
I'll add CC and GFDL licensing for my contributions too. I still think we have a moral obligation to give the R people a share of the credit for the graphic, though. --Avenue (talk) 16:06, 29 March 2010 (UTC)[reply]
According to the CONTU rule, Schutz and only Schutz is the author of the image (by US Copyright law at least): "The question has been raised whether authorship or proprietorship of the program or data base establishes or may establish a claim of authorship of the final work. It appears to the Commission that authorship of the program or of the input data is entirely separate from authorship of the final work..."[1] In addition, section 102b of the US Copyright Act specifies that a work must be "fixed in a tangible medium of expression" in order to be copyrighted. Thus a computer program can be copyrighted, but the output of that program is not copyrighted by the author of the program unless they themselves create the output. Of course many legal scholars consider the CONTU rule to be outdated and suggest that in many situations it should be possible for program authors to successfully claim full or joint authorship of a work generated by their programs. No updated rule has been created, however, to adequately deal with this issue. If Schutz or the authors of R lives outside of the United States, the situation is even less clear. Regardless, I personally don't believe that the GPL is an appropriate license for an image, so I'm going to have to oppose for now. I would recommend contacting the authors of R and asking them to explicitly renounce any authorship in the work. Kaldari (talk) 17:50, 29 March 2010 (UTC)[reply]
I think it would be easier to simply redraw the graphics from scratch without reference to the example code. I may not have time before this closes, though. --Avenue (talk) 12:33, 31 March 2010 (UTC)[reply]
And for what it's worth, many of the R core team are not based in the US. R was initially developed by two people at the University of Auckland in New Zealand, but has grown a lot since then. --Avenue (talk) 12:38, 31 March 2010 (UTC)[reply]
  • Support alt 1 since the units are arbitrary. HereToHelp (talk to me) 19:40, 27 March 2010 (UTC)[reply]
    • As mentioned above though, as far as I can see the units are not arbitrary for two reasons: Firstly, it is not self-evident that the scale is linear (only vaguely implied by the even spaced tickmarks and the straight regression line) without values. Secondly, the values are of historical importance. The theory may be equally valid with other equivalent values, but Anscombe specified these values. Ðiliff «» (Talk) 19:59, 27 March 2010 (UTC)[reply]
  • Alt added Alt 4 explains the graphs in Anscombe's own words. I know that it will not be possible to give full consideration to alt 4 before this nomination closes, but I'm adding it just so that whoever has more experience with SVG can fix the remaining problems (apparently caused by Inkscape). Thank you. Papa Lima Whiskey (talk) 12:13, 28 March 2010 (UTC)[reply]
    • Seems a bit messy. Different fonts used, text pushing up against the margins of the graphs and merging with the values, convoluted image description etc. Also, is there a web reference for the text? Nowhere on the image page does it specify that the text is Anscombe's own words. Ðiliff «» (Talk) 12:33, 28 March 2010 (UTC)[reply]
        • Dunno why you feel the need to bash the shortcomings when they've already been acknowledged. As for the reference, it's in the article, and you can't get it without a JSTOR subscription. Papa Lima Whiskey (talk) 15:00, 28 March 2010 (UTC)[reply]
          • I wasn't 'bashing' the shortcomings, I just listed them so that they could be acknowledged and improved on if possible. I don't think you really did acknowledge the problems I mentioned at all. You mentioned problems but did not specify what they were and I guess I misunderstood as a result of that vagueness. Ðiliff «» (Talk) 16:05, 28 March 2010 (UTC)[reply]
      • Even if you get past the typefaces, it just seems messy. It's hard to read in a thumbnail. I think the point Anscombe is trying to make is plenty clear without making us squint at text that should be in the caption. HereToHelp (talk to me) 14:07, 28 March 2010 (UTC)[reply]
  • Support Alt 3, well executed and excellent EV. Modest Genius talk 15:05, 28 March 2010 (UTC)[reply]
  • Support Alt 3. Most appropriate of the three IMO. Ðiliff «» (Talk) 16:14, 28 March 2010 (UTC)[reply]
  • Support Alt 3 - Sorts my concerns. Shoemaker's Holiday talk 17:42, 28 March 2010 (UTC)[reply]
  • Support Alt 1 or Alt 3.--Avenue (talk) 22:27, 28 March 2010 (UTC)[reply]
  • Comments: as the uploader, I won't vote on the image. Just two comments:
    • The actual values on the axes are pretty important, since they show that the data represented on the four plots is on the same scale (e.g. the x values for plots 1-3 are exactly the same); the dataset is important not only because of the pretty graph, but because the averages, standard deviations, etc, across graphs are the same — the values are required to convene part of this information.
    • The use of a subscript for the variable names is definitively the best way to label them; as such, I have modified the image (and cleaned up the description at the same time), so that the original image and alt 3 should be more or less the same.
    Cheers, Schutz (talk) 15:37, 29 March 2010 (UTC)[reply]
  • I don't have a strong view on this myself, and will quickly note that Anscombe did not use subscripts. Papa Lima Whiskey (talk) 17:21, 29 March 2010 (UTC)[reply]

Promoted File:Anscombe's quartet 3.svg --Makeemlighter (talk) 03:33, 1 April 2010 (UTC)[reply]