~ Project ~
Below are the recommended topics and some related resources/papers.
1.Scheduling and Resource allocation in Big Data Systems
- R. Grandl et al., "Altruistic Scheduling in Multi-Resource Clusters", OSDI 2016.
- R. Grandl et al., "Packing and Dependency-Aware Scheduling for Data-Parallel Clusters", OSDI 2016.
- I. Gog et al., "Firmament: Fast, Centralized Cluster Scheduling at Scale", OSDI 2016.
- J. Jiang et al., "Symbiosis: Network-Aware Task Scheduling in Data-Parallel Frameworks", Infocom 2016.
- P. Delgado et al, "Job-Aware Scheduling in Eagle: Divide and Stick to Your Probes," ACM SoCC 2016.
- Y. Yang et al, "TR-Spark: Transient Computing for Big Data Analytics," ACM SoCC 2016.
2. Wide-area/Geo-distributed Big Data Analytics
- K. Kloudas et al., "Pixida: Optimizing Data Parallel Jobs in Wide-Area Data Analytics", VLDB 2015.
- Vulimiri et al., "Global Analytics in the Face of Bandwidth and Regulatory Constraints", NSDI, 2015.
- Vulimiri et al., "WANalytics: Analytics for a geo-distributed data-intensive world", CIDR, 2015.
- Pu et al., "Low-Latency Analytics of Geo-Distributed Data in the Wide Area", SIGCOMM, 2015.
- Viswanathan et al., "Clarinet: WAN-Aware Optimization for Analytics Queries", OSDI, 2016.
3. Distributed Systems for Deep Learning
- Chilimbi et al., "Project Adam: Building an Efficient and Scalable Deep Learning Training System", OSDI, 2014.
- Martin Abadi et al, "TensorFlow: A System for Large-Scale Machine Learning", OSDI 2016.
- Tim Hunter, "TensorFrames -- Google TensorFlow on Apache Spark", Spark Meetup, June 2016.
- Caffee-on-Spark: http://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep.
- Distributed (Deep) Machine Learning Community (DMLC), https://github.com/dmlc
- MXNet: Flexible and Efficient Library for Deep Learning, https://github.com/dmlc/mxnet
- Deeplearning4J.org, "Running Deep Learning on Distributed GPUs with Spark," http://deeplearning4j.org/spark-gpus
- Alex Chen et al, "Distributed Neural Networks with GPUs in the AWS Cloud", http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html.
4. Distributed Machine Learning Platforms
- F. Niu et al., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," NIPS 2011.
- J. Dean et al, "Large scale distributed deep networks," NIPS 2012.
- Li et al., "Scaling Distributed Machine Learning with the Parameter Server", OSDI, 2014.
- Li et al., "Communication Efficient Distributed Machine Learning with the Parameter Server", NIPS 2014.
- Eric P. Xing et al., "Petuum: A new platform for Distributed Machine Learning on Big Data", IEEE Transactions on Big Data, 2015.
- T. Chen et al., "MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems", NIPS Workshop on Machine Learning Systems (LearnSys), 2015.
- MXNet: Flexible and Efficient Library for Deep Learning, https://github.com/dmlc/mxnet
- Gonzalez J. E. et al., "Asynchronous Complex Analytics in a Distributed Dataflow Architecture", arXiv preprint arXiv:1510.07092 (2015).
- X. Pan et al, "Cyclades: Conflict-free Asynchronous Machine Learning," NIPS 2016.
5. Stateful Dataflow
- D. Murray, "Incremental., iterative data processing with timely dataflow", Communications of ACM 2016.
- Murray et al., "Naiad: A Timely Dataflow System", SOSP, 2013.
- P. Pietzuch et al., "Stateful Distributed Dataflow Graphs",
- R. C. Fernandez et al., "Making state explicit for imperative big data processing". In USENIX ATC, 2014.
- R. C. Fernandez et al., "Integrating scale out and fault tolerance in stream processing using operator state management". ACM SIGMOD 2013.
6. Stream Analytics
- Lin et al., "StreamScope: Continuous Reliable Distributed Processing of Big Data Streams", NSDI, 2016.
- Kulkarni et al., "Twitter Heron: Stream Processing at Scale", SIGMOD, 2015.
- Toshniwal et al., "Storm @Twitter", SIGMOD, 2014.
- Rabkin et al., "Aggregation and Degradation in JetStream: Streaming analytics in the wide area", NSDI, 2014.
- T. Condie, et al., "MapReduce Online,", NSDI 2010.
- A. Alexandrov et al., "The Stratosphere platform for Big Data Analytics", VLDB 2014.
- R. C. Fernandez et al., "Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management". In SIGMOD, 2013.
7. SQL-based Big Data Systems
- A. Floratou et al., "SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures", VLDB 2014.
- Kornacker et al., "Impala: A Modern, Open-Source SQL Engine for Hadoop", CIDR 2015.
- Huai et al., "Major technical advancements in Apache Hive", SIGMOD, 2014.
- Armburst et al., "Spark SQL: Relational Data Processing in Spark", SIGMOD, 2015
- L. Chang, "Presto: Interacting with petabytes of data at Facebook", blog by L.Chang, 2013.
- W. Alkowaileet et al., "Large-scale Complex Analytics on Semi-structured Datasets using AsterixDB and Spark", VLDB 2016.
- A. Alexandrov et al., "Emma in Action: Declarative Dataflows for Scalable Data Analysis", SIGMOD 2016.
8. Systems for Big Graph Analytics
- Carlos H.C. Teixeira et al., "Arabesque: A system for distributed graph mining", SOSP 2015.
- Amitabha Roy et al., "Chaos: Scale-out Graph Processing from Secondary Storage", SOSP 2015.
- Anessh Sharma et al., "GraphJet: Real-Time Content Recommendations at Twitter", VLDB 2016.
- D. Yan et al., "Big Graph Analytics Systems", SIGMOD 2016
- D. Yan et al., "A General-Purpose Query-Centric Framework for Querying Big Graphs", VLDB 2016.
9. Approximation Query
- S. Agarwal et al., "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data", Eurosys, 2013.
- S. Agarwal et al., "Knowing when you're wrong: building fast and reliable approximate query processing systems," SIGMOD 2014.
- S. Agarwal et al., "Succinct: Enabling Queries on Compressed Data", NSDI, 2015.
- G. Ananthanarayanan, et al. "GRASS: trimming stragglers in approximation analytics," NSDI 2014.
10. Monitoring and Diagnosis in Data Center Scaling Computing
- Mace et al., "Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems",SOSP, 2015.
- M. Moshref et al., "Trumpet: Timely and Precise Triggers in Data Centers", Sigcomm 2016.
- R. Sambasivan et al, "Principled workflow-centric tracing of distributed systems," ACM SoCC 2016.
- M. Leich "Runtime Analysis of Distributed Data Processing Programs," VLDB 2014.
- E, Coppa et al., "On Data Skewness, Stragglers and MapReduce Progress Indicators," ACM SoCC 2015.
11. Matrix Computations on Distributed Cluster
- R. B. Zadeh et al., "Matrix Computations and Optimization in Apache Spark", KDD 2016.
- A. Elgohary et al., "Compressed Linear Algebra for Large-scale Machine Learning", VLDB 2016.
- M. Li et al., "Cuckoo Linear Algebra", KDD 2015.
12. Management for Data-Center Networks
- A. Singh et al, "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network," SIGCOMM 2015.
- R. Govindan et al., "Evolve or Die: High-Availabiity Design Principles Drawn from Google’s Network Infrastructure", SIGCOMM 2016.
- Y. W. Sung et al., "Robotron: Top-down Network Management at Facebook Scale", SIGCOMM 2016.
- Chow, Michael, et al. "The Mystery Machine: End-to-end performance analysis of largescale Internet service," OSDI 2014.
13. Traffic Flow Scheduling for Data Center Networks
- M. Chowdhury, I. Stoica, "Managing data transfers in computer clusters with orchestra", SIGCOMM 2011.
- M. Chowdhury, I. Stoica, "Efficient coflow scheduling with Varys", SIGCOMM 2014.
- F.R. Dogar et al., "Decentralized Task-Aware Scheduling for Data Center Networks", SIGCOMM 2014.
- M. Chowdhury, I. Stoica, "Efficient Coflow Scheduling Without Prior Knowledge". SIGCOMM 2015.
- Y. Zhao et al., "RAPIER: Integrating Routing and Scheduling for Coflow-aware Data Center Networks", Infocom 2016.
- Y. Li et al., "Efficient online coflow routing and scheduling", ACM Mobihoc 2016.
- H. Zhang et al., "CODA: Toward Automatically Identifying and Scheduling COflows in the DArk", ACM SIGCOMM 2016.
- L. Chen et al., "Scheduling Mix-flows in Commodity Datacenters with Karuna", ACM SIGCOMM 2016.
- P. Wang et al, "Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks," ACM SoCC 2016.
14. Performance Prediction for Large-scale Analytics
- S. Venkataraman et al., "Ernest: Efficient Performance Prediction for Large-scale Advanced Analytics", NSDI 2016.
- K. Ousterhout, "Re-architecting Spark for Performance Understandability", Spark Summit 2016 talk
- K. Ousterhout et al., "Making Sense of Performance in Data Analytics Frameworks", NSDI 2015.
- D. Crankshaw et al, "The missing piece in complex analytics: low latency, scalable model management and serving with Velox", CIDR 2015.
- N. J. Yadwadkar et al, "Wrangler: Predictable and Faster Jobs using Fewer Resources", ACM SoCC 2014.
- E. Sparks et al, "Automating model search for large-scale machine learning," ACM SoCC 2015.
- K. Rajan et al,"PerfOrator: eloquent performance models for Resource Optimization," ACM SoCC 2016.
- N. J. Yadwadkar et al., "Katz: Faster Jobs in Distributed Data Processing using Multi-Task Learning", SDM 2015.
15. VM/ Cloud Resource Management/Scheduling
- W. Lang et al., "Not for the Timid: On the impact of Aggressive Over-booking in the Cloud", VLDB 2016.
- C. Fuerst et al., "Kraken: Online and Elastic Resource Reservations for Multi-tenant Datacenters", Infocom 2016.
- Z. Han et al., "Dynamic Virtual Machine Management via Approximate Markov Decision Process", Infocom 2016.
- J. Mace et al., "2DFQ: Two-dimensional Fair Queueing for Multi-Tenant Cloud Services", SIGCOMM 2016.
- J. Chaderi, "Randomized Algorithms for Scheduling VMs in the Cloud", Infocom 2016.
16. Analyzing Dynamic/Time-evolving Graphs in Large-Computing Clusters
- I. Stoica, "Time-evolving Graph Processing on Commodity Clusters", Spark Summit 2016.
- A. P. Iyer et al. "Time-evolving graph processing at scale." Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. ACM, 2016.
- Z. Y. Dong, "A Framework for Computing on Large Dynamic Graphs", arXiv preprint arXiv:1512.01668 (2015).
17. Accelerator/GPU Spark Integration
- Di Wu et al, "Deploying Accelerators at DataCenter Scale using Spark," Spark Summit 2016.
- Y.Hu et al, "GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale," Spark Summit 2016.
- Y.T. Chen et al, "Apache Spark Meets FPGAs: A case study for Next Generation DNA Sequencying Acceleration," HotCloud 2016.
- M. Huang et al, "Programming and Runtime Support to Blaze FPGA Accelerator Deployment at Datacenter Scale", ACM SoCC 2016.
18. Systems for In-memory Big Data Management and Processing
- H. Zhang et al., "In-memory big data management and processing: A survey", IEEE Transactions on Knowledge and Data Engineering 27.7 (2015): 1920-1948.
19. Machine Learning API/Toolkits for Large-scale Clusters
- Kraska et al., "MLbase: A Distributed Machine-learning System", CIDR, 2013.
- Sparks et al., "MLI: An API for Distributed Machine Learning", ICDM, 2013.
- M. Boehm et al., SystemML: Declarative Machine Learning on Spark, VLDB 2016.
- Microsoft Distributed Machine Learning Toolkit (DMTK) http://www.dmtk.io/index.html.