Exploratory data analysis

Source: Wikipedia, the free encyclopedia.
Exploratory Data Analysis: Unveiling Insights into Edge Intelligence Enhancement. In this comprehensive exploration, the graph traces the trajectories of two curves - one representing the quantitative assessment model for edge intelligence enhancement, and the other showcasing actual test results. Both embark from the origin (0,1) and converge meaningfully at (80,70), indicating a shared comprehensive proportion during this pivotal phase. Intriguingly, as the data unfolds beyond this point, a discernible divergence emerges. The Edge Intelligence Enhancement Model consistently surpasses actual test results, revealing a compelling reserve in comprehensive proportions. This nuanced visual narrative provides valuable insights into the intricate dynamics between modeled predictions and empirical outcomes, underscoring the significance of exploratory data analysis in unraveling the complexities of enhanced edge intelligence.

In

data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA),[1][2]
which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.

Overview

Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."[3]

Exploratory data analysis is an analysis technique to analyze and investigate the data set and summarize the main characteristics of the dataset. Main advantage of EDA is providing the data visualization of data after conducting the analysis.

Tukey's championing of EDA encouraged the development of

trends and patterns
in data that merited further study.

Tukey's EDA was related to two other developments in

mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap
, which are nonparametric and robust (for many problems).

Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the

testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families.[5]

Development

Data science process flowchart

systematic bias owing to the issues inherent in testing hypotheses suggested by the data
.

The objectives of EDA are to:

Many EDA techniques have been adopted into data mining. They are also being taught to young students as a way to introduce them to statistical thinking.[8]

Techniques and tools

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques.[9]

Typical graphical techniques used in EDA are:

Dimensionality reduction:

Typical quantitative techniques are:

History

Many EDA ideas can be traced back to earlier authors, for example:

The

Gottfried Noether's work, which introduced statistical inference via coin-tossing and the median test
.

Example

Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter.[12] The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is

(tip rate) = 0.18 - 0.01 × (party size)

which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%, on average.

However, exploring the data reveals other interesting features not described by this model.

  • Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities.
    Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities.
  • Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline.
    Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline.
  • Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see variation that increases with tip amount. In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous.
    Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see
    variation that increases with tip amount
    . In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous.
  • Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample).
    Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample).

What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data.

Software

  • JMP, an EDA package from SAS Institute.
  • KNIME, Konstanz Information Miner – Open-Source data exploration platform based on Eclipse.
  • Minitab, an EDA and general statistics package widely used in industrial and corporate settings.
  • Orange, an open-source data mining and machine learning software suite.
  • Python, an open-source programming language widely used in data mining and machine learning.
  • R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science.
  • TinkerPlots an EDA software for upper elementary and middle school students.
  • Weka an open source data mining package that includes visualization and EDA tools such as targeted projection pursuit
    .

See also

References

  1. .
  2. .
  3. ^ John Tukey-The Future of Data Analysis-July 1961
  4. ^ Becker, Richard A., A Brief History of S, Murray Hill, New Jersey: AT&T Bell Laboratories, archived from the original (PS) on 2015-07-23, retrieved 2015-07-23, ... we wanted to be able to interact with our data, using Exploratory Data Analysis (Tukey, 1971) techniques.
  5. .
  6. .
  7. ^ Behrens-Principles and Procedures of Exploratory Data Analysis-American Psychological Association-1997
  8. .
  9. .
  10. .
  11. ^ Elementary Manual of Statistics (3rd edn., 1920)https://archive.org/details/cu31924013702968/page/n5
  12. ^ Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) "Interactive and Dynamic Graphics for Data Analysis: With R and GGobi" Springer, 978-0387717616

Bibliography


External links