~ Project ~


Below are the recommended topics and some related resources/papers.


1.Scheduling and Resource allocation in Big Data Systems

  • R. Grandl et al., "Altruistic Scheduling in Multi-Resource Clusters", OSDI 2016.
  • R. Grandl et al., "Packing and Dependency-Aware Scheduling for Data-Parallel Clusters", OSDI 2016.
  • I. Gog et al., "Firmament: Fast, Centralized Cluster Scheduling at Scale", OSDI 2016.
  • J. Jiang et al., "Symbiosis: Network-Aware Task Scheduling in Data-Parallel Frameworks", Infocom 2016.
  • P. Delgado et al, "Job-Aware Scheduling in Eagle: Divide and Stick to Your Probes," ACM SoCC 2016.
  • Y. Yang et al, "TR-Spark: Transient Computing for Big Data Analytics," ACM SoCC 2016.

2. Wide-area/Geo-distributed Big Data Analytics

  • K. Kloudas et al., "Pixida: Optimizing Data Parallel Jobs in Wide-Area Data Analytics", VLDB 2015.
  • Vulimiri et al., "Global Analytics in the Face of Bandwidth and Regulatory Constraints", NSDI, 2015.
  • Vulimiri et al., "WANalytics: Analytics for a geo-distributed data-intensive world", CIDR, 2015.
  • Pu et al., "Low-Latency Analytics of Geo-Distributed Data in the Wide Area", SIGCOMM, 2015.
  • Viswanathan et al., "Clarinet: WAN-Aware Optimization for Analytics Queries", OSDI, 2016.

3. Distributed Systems for Deep Learning

4. Distributed Machine Learning Platforms

  • F. Niu et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011.
  • J. Dean et al, "Large scale distributed deep networks," NIPS 2012.
  • Li et al., "Scaling Distributed Machine Learning with the Parameter Server", OSDI, 2014.
  • Li et al., "Communication Efficient Distributed Machine Learning with the Parameter Server", NIPS 2014.
  • Eric P. Xing et al., "Petuum: A new platform for Distributed Machine Learning on Big Data", IEEE Transactions on Big Data, 2015.
  • T. Chen et al., "MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems", NIPS Workshop on Machine Learning Systems (LearnSys), 2015.
  • MXNet: Flexible and Efficient Library for Deep Learning, https://github.com/dmlc/mxnet
  • Gonzalez J. E. et al., "Asynchronous Complex Analytics in a Distributed Dataflow Architecture", arXiv preprint arXiv:1510.07092 (2015).
  • X. Pan et al, "Cyclades: Conflict-free Asynchronous Machine Learning," NIPS 2016.

5. Stateful Dataflow

  • D. Murray, "Incremental., iterative data processing with timely dataflow", Communications of ACM 2016.
  • Murray et al., "Naiad: A Timely Dataflow System", SOSP, 2013.
  • P. Pietzuch et al., "Stateful Distributed Dataflow Graphs",
  • R. C. Fernandez et al., "Making state explicit for imperative big data processing". In USENIX ATC, 2014.
  • R. C. Fernandez et al., "Integrating scale out and fault tolerance in stream processing using operator state management". ACM SIGMOD 2013.

6. Stream Analytics

  • Lin et al., "StreamScope: Continuous Reliable Distributed Processing of Big Data Streams", NSDI, 2016.
  • Kulkarni et al., "Twitter Heron: Stream Processing at Scale", SIGMOD, 2015.
  • Toshniwal et al., "Storm @Twitter", SIGMOD, 2014.
  • Rabkin et al., "Aggregation and Degradation in JetStream: Streaming analytics in the wide area", NSDI, 2014.
  • T. Condie, et al., "MapReduce Online,", NSDI 2010.
  • A. Alexandrov et al., "The Stratosphere platform for Big Data Analytics", VLDB 2014.
  • R. C. Fernandez et al., "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management". In SIGMOD, 2013.

7. SQL-based Big Data Systems

  • A. Floratou et al., "SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures", VLDB 2014.
  • Kornacker et al., "Impala: A Modern, Open-Source SQL Engine for Hadoop", CIDR 2015.
  • Huai et al., "Major technical advancements in Apache Hive", SIGMOD, 2014.
  • Armburst et al., "Spark SQL: Relational Data Processing in Spark", SIGMOD, 2015
  • L. Chang, "Presto: Interacting with petabytes of data at Facebook", blog by L.Chang, 2013.
  • W. Alkowaileet et al., "Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark", VLDB 2016.
  • A. Alexandrov et al., "Emma in Action: Declarative Dataflows for Scalable Data Analysis", SIGMOD 2016.

8. Systems for Big Graph Analytics

  • Carlos H.C. Teixeira et al., "Arabesque: A system for distributed graph mining", SOSP 2015.
  • Amitabha Roy et al., "Chaos: Scale-out Graph Processing from Secondary Storage", SOSP 2015.
  • Anessh Sharma et al., "GraphJet: Real-Time Content Recommendations at Twitter", VLDB 2016.
  • D. Yan et al., "Big Graph Analytics Systems", SIGMOD 2016
  • D. Yan et al., "A General-Purpose Query-Centric Framework for Querying Big Graphs", VLDB 2016.

9. Approximation Query

  • S. Agarwal et al., "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data", Eurosys, 2013.
  • S. Agarwal et al., "Knowing when you're wrong: building fast and reliable approximate query processing systems," SIGMOD 2014.
  • S. Agarwal et al., "Succinct: Enabling Queries on Compressed Data", NSDI, 2015.
  • G. Ananthanarayanan, et al. "GRASS: trimming stragglers in approximation analytics," NSDI 2014.

10. Monitoring and Diagnosis in Data Center Scaling Computing

  • Mace et al., "Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems",SOSP, 2015.
  • M. Moshref et al., "Trumpet: Timely and Precise Triggers in Data Centers", Sigcomm 2016.
  • R. Sambasivan et al, "Principled workflow-centric tracing of distributed systems," ACM SoCC 2016.
  • M. Leich "Runtime Analysis of Distributed Data Processing Programs," VLDB 2014.
  • E, Coppa et al., "On Data Skewness, Stragglers and MapReduce Progress Indicators," ACM SoCC 2015.

11. Matrix Computations on Distributed Cluster

  • R. B. Zadeh et al., "Matrix Computations and Optimization in Apache Spark", KDD 2016.
  • A. Elgohary et al., "Compressed Linear Algebra for Large-scale Machine Learning", VLDB 2016.
  • M. Li et al., "Cuckoo Linear Algebra", KDD 2015.

12. Management for Data-Center Networks

  • A. Singh et al, "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network," SIGCOMM 2015.
  • R. Govindan et al., "Evolve or Die: High-Availabiity Design Principles Drawn from Google’s Network Infrastructure", SIGCOMM 2016.
  • Y. W. Sung et al., "Robotron: Top-down Network Management at Facebook Scale", SIGCOMM 2016.
  • Chow, Michael, et al. "The Mystery Machine: End-to-end performance analysis of large­scale Internet service," OSDI 2014.

13. Traffic Flow Scheduling for Data Center Networks

  • M. Chowdhury, I. Stoica, "Managing data transfers in computer clusters with orchestra", SIGCOMM 2011.
  • M. Chowdhury, I. Stoica, "Efficient coflow scheduling with Varys", SIGCOMM 2014.
  • F.R. Dogar et al., "Decentralized Task-Aware Scheduling for Data Center Networks", SIGCOMM 2014.
  • M. Chowdhury, I. Stoica, "Efficient Coflow Scheduling Without Prior Knowledge". SIGCOMM 2015.
  • Y. Zhao et al., "RAPIER: Integrating Routing and Scheduling for Coflow-aware Data Center Networks", Infocom 2016.
  • Y. Li et al., "Efficient online coflow routing and scheduling", ACM Mobihoc 2016.
  • H. Zhang et al., "CODA: Toward Automatically Identifying and Scheduling COflows in the DArk", ACM SIGCOMM 2016.
  • L. Chen et al., "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016.
  • P. Wang et al, "Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks," ACM SoCC 2016.

14. Performance Prediction for Large-scale Analytics

  • S. Venkataraman et al., "Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics", NSDI 2016.
  • K. Ousterhout, "Re-architecting Spark for Performance Understandability", Spark Summit 2016 talk
  • K. Ousterhout et al., "Making Sense of Performance in Data Analytics Frameworks", NSDI 2015.
  • D. Crankshaw et al, "The missing piece in complex analytics: low latency, scalable model management and serving with Velox", CIDR 2015.
  • N. J. Yadwadkar et al, "Wrangler: Predictable and Faster Jobs using Fewer Resources", ACM SoCC 2014.
  • E. Sparks et al, "Automating model search for large-scale machine learning," ACM SoCC 2015.
  • K. Rajan et al,"PerfOrator: eloquent performance models for Resource Optimization," ACM SoCC 2016.
  • N. J. Yadwadkar et al., "Katz: Faster Jobs in Distributed Data Processing using Multi-Task Learning", SDM 2015.

15. VM/ Cloud Resource Management/Scheduling

  • W. Lang et al., "Not for the Timid: On the impact of Aggressive Over-booking in the Cloud", VLDB 2016.
  • C. Fuerst et al., "Kraken: Online and Elastic Resource Reservations for Multi-tenant Datacenters", Infocom 2016.
  • Z. Han et al., "Dynamic Virtual Machine Management via Approximate Markov Decision Process", Infocom 2016.
  • J. Mace et al., "2DFQ: Two-dimensional Fair Queueing for Multi-Tenant Cloud Services", SIGCOMM 2016.
  • J. Chaderi, "Randomized Algorithms for Scheduling VMs in the Cloud", Infocom 2016.

16. Analyzing Dynamic/Time-evolving Graphs in Large-Computing Clusters

  • I. Stoica, "Time-evolving Graph Processing on Commodity Clusters", Spark Summit 2016.
  • A. P. Iyer et al. "Time-evolving graph processing at scale." Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. ACM, 2016.
  • Z. Y. Dong, "A Framework for Computing on Large Dynamic Graphs", arXiv preprint arXiv:1512.01668 (2015).

17. Accelerator/GPU Spark Integration

  • Di Wu et al, "Deploying Accelerators at DataCenter Scale using Spark," Spark Summit 2016.
  • Y.Hu et al, "GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale," Spark Summit 2016.
  • Y.T. Chen et al, "Apache Spark Meets FPGAs: A case study for Next Generation DNA Sequencying Acceleration," HotCloud 2016.
  • M. Huang et al, "Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale", ACM SoCC 2016.

18. Systems for In-memory Big Data Management and Processing

  • H. Zhang et al., "In-memory big data management and processing: A survey", IEEE Transactions on Knowledge and Data Engineering 27.7 (2015): 1920-1948.

19. Machine Learning API/Toolkits for Large-scale Clusters

  • Kraska et al., "MLbase: A Distributed Machine-learning System", CIDR, 2013.
  • Sparks et al., "MLI: An API for Distributed Machine Learning", ICDM, 2013.
  • M. Boehm et al., SystemML: Declarative Machine Learning on Spark, VLDB 2016.
  • Microsoft Distributed Machine Learning Toolkit (DMTK) http://www.dmtk.io/index.html.