Reynold Xin

Source: Wikipedia, the free encyclopedia.
Reynold Xin
Alma mater
Computer Science
Doctoral advisorMichael J. Franklin

Reynold Xin is a

Big Data project.[2] He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.[3]

Biography

Berkeley

Xin started his work on the Spark open source project while he was a doctoral candidate at the AMPLab at the University of California, Berkeley. He received his Ph.D. in computer science from Berkeley, where his advisors were Michael J. Franklin and Ion Stoica.[4]

The first research project, Shark,[5] created a system that was able to efficiently execute SQL and advanced analytics workloads at scale. Shark won Best Demo Award at SIGMOD 2012.[6] Shark was one of the first open source interactive SQL on Hadoop systems, with claims that it was between 10 and 100 times faster than Apache Hive. Shark was used by technology companies such as Yahoo,[7] although it was replaced by a newer system called Spark SQL in 2014.[8]

The second research project, GraphX,[9] created a graph processing system on top of Spark, a general data-parallel system. GraphX at the same challenged the notion that specialized systems are necessary for graph computation. GraphX was released as an open source project and merged into Spark in 2014, as the graph processing library on Spark.

Databricks

In 2013, along with Matei Zaharia and other key Spark contributors, Xin co-founded Databricks, a venture-backed company based in San Francisco that offers data platform as a service, based on Spark.

In 2014, Xin led a team of engineers from Databricks to compete in the Sort Benchmark and won the 2014 world record in Daytona GraySort using Spark, beating the previous record held by Apache Hadoop by 30 times.[10] Xin claimed that Spark was the fastest open source engine for sorting a petabyte of data.[11]

While at Databricks, he also started the DataFrames project,[12] Project Tungsten,[13] and Structured Streaming.[14] DataFrames has become the foundational API while Tungsten has become the new execution engine.

References

  1. ^ "Reynold Xin: Executive Profile & Biography - Businessweek". bloomberg.com. Bloomberg Businessweek. Retrieved 21 September 2016.
  2. ^ Woodie, Alex (8 June 2016). "Apache Spark Adoption by the Numbers". datanami.com. Tabor Communications. Retrieved 21 September 2016.
  3. ^ "Apache Spark Developers List - [ANNOUNCE] Announcing Apache Spark 2.0.0". apache-spark-developers-list.1001551.n3.nabble.com. Retrieved 2016-08-04.
  4. ^ "Speaker Reynold Xin". engsci.utoronto.ca. 5 October 2020.
  5. S2CID 1597960
    .
  6. ^ "Shark Wins Best Demo Award at SIGMOD 2012". AMPLab - UC Berkeley. 24 May 2012. Retrieved 2016-08-04.
  7. ^ Tully. "Analytics on Spark & Shark @Yahoo" (PDF).
  8. ^ "Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark". 2014-07-01. Retrieved 2016-08-04.
  9. .
  10. ^ Finley, Klint. "Startup Crunches 100 Terabytes of Data in a Record 23 Minutes". Wired. Retrieved 2016-08-04.
  11. ^ "Apache Spark the fastest open source engine for sorting a petabyte". 2014-10-10. Retrieved 2016-08-04.
  12. ^ "Introducing DataFrames in Apache Spark for Large Scale Data Science". 2015-02-17. Retrieved 2016-08-04.
  13. ^ Woodie, Alex (4 May 2015). "Deep Dive Into Databricks' Big Speedup Plans for Apache Spark". datanami.com. Tabor Communications. Retrieved 21 September 2016.
  14. ^ Woodie, Alex (25 February 2016). "Spark 2.0 to Introduce New 'Structured Streaming' Engine". datanami.com. Tabor Communications. Retrieved 21 September 2016.