DevOps conference recordings from Q4 2017: AWS re:Invent, LISA, PuppetConf and DevOpsDays

DevOps Talks Q4 2017This post is my first attempt to index the recorded talks from DevOps related events here on MeetupFeed. These 71 talks had been recorded during Q4 2017 at events like AWS re:Invent, PuppetConf, LISA and DevOpsDays. The next DevOps post will be available in April. Subscribe: twitter, newsletter

SF Bay Area OpenStack (Sunnyvale, CA)

Services as a Platform: Making DevOps Real at Scale

Randy Bias (Juniper Networks)

In an era when software is “eating the world,” developers rule. Public cloud services understand the economic implications of this trend, and no one more so than Amazon Web Services. The public cloud giant has transformed the way software developers interact with and rely on infrastructure and the services atop that infrastructure. This talk by early OpenStack pioneer Randy Bias will show how a nascent AWS-inspired movement is changing how an entire industry thinks about the way software developers and infrastructure managers do their jobs. Show more...

That movement—called “services as a platform” or SaaP—allows developers to use a catalog of composable, reusable components to assemble only the services they need to support each application. Essentially, SaaP lets developers build a customized platform-as-a-service (or PaaS) for each application they deploy. The implications for this new concept in abstraction are potentially as important as the seismic shift to cloud computing that began a decade ago. Randy will demonstrate that while the industry was arguing about whether AWS was building IaaS or PaaS, it was really building an entirely new concept, SaaP. The talk will offer an operational definition of SaaP, demonstrate how it is the logical descendant of IaaS and PaaS, articulate how it unlocks new forms of value for developers, operators, and their enterprises, and offer a roadmap for how everyone can use the concepts of SaaP in their own organizations as they move to agile and DevOps methodologies. We will also cover how OpenStack and other open source technologies can be used to develop your own SaaP. Then we will open up the floor for an “Ask Me Anything” session with Randy, hosted by Robert Starmer. Attend this talk if you are involved in helping your organization develop strategies and processes to succeed in a software-defined-everything world, or if you want to network with and hear cutting edge thoughts from pioneers, visionaries, and open source experts in our community. AND… do your homework ahead of time and read Randy’s thoughtful blog on this topic here.

Randy Bias Randy is the Vice President of Technology and Strategy for Cloud Software at Juniper Networks. Randy was also a pioneer and early, vocal advocate for the OpenStack project, and has led teams that achieved numerous cloud firsts, including the first public cloud in Korea, the first global carrier NFV cloud, and the first “cattle cloud” for a Fortune 5 company (after popularizing the “pets vs. cattle” meme as a construct for describing the fundamental difference between how enterprise stacks and cloud stacks are managed). As a strategic R&D lead at Dell EMC, Randy also led the open sourcing of several products. Randy Bias is an entrepreneur, writer, speaker and futurist in cloud computing. Find Randy on Twitter @RandyBias Robert Starmer Robert is the founder and CTO of Kumulus Technologies, a San Francisco Bay area infrastructure-focused DevOps, Systems Reliability Engineering, and Cloud Computing consultancy, where he writes, presents, develops, and educates industry customers on cloud, SRE and DevOps. Robert is also a Certified OpenStack Administrator (COA).

PuppetConf 2017 (San Francisco, CA)

PuppetConf 2017 Day 1 keynote and product announcements

Sanjay Mirchandani (Puppet)

We kicked off PuppetConf 2017 with our CEO Sanjay Mirchandani talking about the journey to pervasive automation — and announcing Puppet’s biggest and most important product innovations to date.

PuppetConf 2017 Day 1 keynote and product announcements

Omri Gazitt (Puppet)

See and hear all about Puppet’s biggest and most important product innovations to date — including Puppet Discovery, Puppet Tasks, Puppet Enterprise 2017.3, Puppet Pipelines — from Omri Gazitt, Sarrah Figueroa, Eric Sorenson and Rahul Singh.

PuppetConf 2017 Day 2 keynote

Tricia Burke (Diligent)

Words are Hard: Shifting How We Communicate at Diligent – Tricia Burke, VP of Production Operations at Diligent, shares how her team is scaling their automation success by partnering with developers to accelerate release cycle times.

PuppetConf 2017 Day 2 keynote

Michael Lopp (Slack)

In order to be a historic company, you need your culture to evolve. Learn how during Michael Lopp’s presentation.

PuppetConf 2017 Day 2 keynote

Thorsten Biel (Porsche)

Shifting into Higher Gear at Porsche – Hear from Thorsten Biel, Manager of Cloud and Integration Services, about how Porsche is leveraging Puppet Enterprise to support the company’s shift to a digital service provider.

Find more PuppetConf talks on their YouTube channel.

AWS re:Invent 2017 (Las Vegas, NV)

AWS re:Invent 2017 Keynote

Andy Jassy (AWS)

Andy Jassy, CEO of Amazon Web Services, delivers his AWS re:Invent 2017 keynote, featuring the latest news and announcements, including the launches of Amazon Elastic Containers for Kubernetes (EKS), AWS Fargate, Aurora Multi-Master, Aurora Serverless, DynamoDB Global Tables, Amazon Neptune, S3 Select, Amazon Sagemaker, AWS DeepLens, Amazon Rekognition Video, Amazon Kinesis Video Streams, Amazon Transcribe, Amazon Translate, Amazon Comprehend, AWS IoT 1-Click, AWS IoT Device Management, AWS IoT Device Defender, AWS IoT Analytics, Amazon FreeRTOS, and Greengrass ML Inference. Guest speakers include Dr. Matt Wood, of AWS; Roy Joseph, of Goldman Sachs; Mark Okerstrom, of Expedia; and Michelle McKenna-Doyle, of the NFL.

AWS re:Invent 2017 Keynote – Tuesday Night Live

Peter DeSantis (AWS)

Watch Peter DeSantis, VP, AWS Global Infrastructure, in the Tuesday Night Live keynote, featuring Brian Mathews, of Autodesk, and Greg Peters, of Netflix.

AWS re:Invent 2017 Keynote

Werner Vogels (AWS)

Watch Werner Vogels deliver his AWS re:Invent 2017 keynote, featuring the launch of Alexa for Business, AWS Cloud9, new AWS Lambda features, and Serverless App Repository.

Keep watching more re:Invent talks on the AWS YouTube channel with organized playlists.

LISA (San Fransisco, CA)

Where’s the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom!

David Blank-Edelman

Let’s face it. We are great at building things—systems, services, infrastructures—you name it. But we are terrible, absolutely terrible, at decommissioning, demolishing, or destroying these same things in any sort of principled way. We spend so much time focused on how to construct systems that when it comes time to do the dance of destruction we are at a loss. We are even worse at building systems that will later be easy to destroy. But it doesn’t have to be this way. When they take down a bridge, a building, or even your bathroom before a renovation, things just don’t get ripped out willy-nilly (hopefully). There are methods, best practices, and lots of lots of careful work being brought to bear in these situations. There are people who demolish stuff for a living, let’s see what we can learn from them to take back to our own practice. Come to this talk not just for the explosions (and oh, yes, there will be explosions), but also to explore an important part of your work that never gets talked about: the kaboom.

Coherent Communications—What We Can Learn from Theoretical Physics

Kevin Barron (University of California)

In the tech world we typically focus almost exclusively on instrumental communication—because once we have nailed our communications objective in unambiguous, non-jargon, we feel we can precisely communicate with our clientele, and team members. And yet we fail—often spectacularly. Then we blame all the wrong things: the clients did not take enough interest, the team members were distracted or went off-message. On the other hand, we sometimes experience what seem to be spontaneous moments of clarity and free-flowing ideas, but rarely consider what enabled it. To better understand this dynamic, we need to step back and take the end-to-end view. In other words, use the same troubleshooting methods we would apply to a technical problem. Once we take a broader systemic view, we can remove the problems, and actively promote coherent communication.

Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful

Blake Bisset & Jonah Horowitz (Stripe)

People aren’t just wrong on the internet. Sometimes they bring it back to the office. We’re here to debunk the biggest traps we’ve stepped in, spent good drink money learning about from other people who’d stepped in them, or seen someone who hadn’t stepped in them yet propose as good practice. Save yourself some pain. Or just laugh at ours. The talk addresses specific anti-patterns we’ve seen in building teams and systems to manage service delivery for very large scale operations, and more appropriate ways to approach those issues.

The 7 Deadly Sins of Documentation

Chastity Blackwell (Yelp)

Documentation, or the lack of it, is often one of the biggest issues with working in tech. In most places, code has supremacy, and documentation ends up being an afterthought. Unfortunately, even in places where documentation is actually written, it’s often done quickly or poorly in the first place, not maintained, or not organized in a way that makes it easy to use. This talk will discuss the biggest problems surrounding creating, maintaining, and providing utility with documentation, and how to solve them.

Working with DBAs in a DevOps World

Silvia Botros (Sendgrid)

DevOps is about breaking silos. Bringing everyone to the table to bring more value to the company. But how does that fit with specialists on a team like DBAs who, by definition, are a silo of specific knowledge? Trick question! I don’t think DBA’s are ‘by definition a silo’. I have been a DBA with outdated expectations of my roles in the past. And if you would like to know how to promote collaboration with your DBA team, I have some stories to share! In this talk, I will show you how to help your DBA get involved early in your feature planning, and how to draw on their expertise and use their knowledge to turn good performance and operationality into v1 features and not add-ons. I will draw from my experience as the only DBA in a rapidly growing company that was also learning how to DevOps as I was learning what the word means and doesn’t mean. I will give examples on how to grow the relation between the DBA and the engineering teams to bring stronger collaboration. And I will share lessons learned from both projects that went well and some that had bumps on the road and why they did.

Case Study: Deploying a Multi-Region, Highly Available MySQL Architecture

Gabriel Ciciliani (Pythian)

A customer recently asked us to design a multi region database architecture that allows their application to read by default from a local database instance while writing on a single master region. It would also need an automated way to handle failures on any of the regional database instances by redirecting both, read and write traffic to an available region. In this session we are going to go through the architecture designed to fulfill the above requirements, what technologies were considered and why ProxySQL was chosen. We will also also discuss advantages and limitations of the proposed architecture while sharing a few lessons learned in the process.

Distributed Tracing: From Theory to Practice

Stella Cotton (Heroku)

Application performance monitoring is great for debugging inside a single app. However, as a system expands into multiple services, how can you understand the health of the system as a whole? Distributed tracing can help! You’ll learn the theory behind how distributed tracing works. But we’ll also dive into other practical considerations you won’t get from a README, like choosing libraries for your polyglot systems, infrastructure considerations, and security.

Stories from the Trenches of Government Technology

Matt Cutts & Raquel Romano (US Digital Service)

Since 2014, the US Digital Service has worked to improve government services from to small businesses to veterans getting their benefits. Come hear some of the frustrating, surprising, and gratifying stories that we’ve seen as technologists trying to make government work better.

Fast Log Analysis Made Easy by Automatically Parsing Heterogeneous Logs

Biplob Debnath & Will Dennis (NEC Laboratories America, Inc.)

Existing log analysis tools like ELK (Elasticsearch-LogStash-Kibana), VMware LogInsight, Loggly, etc. provide platforms for indexing, monitoring, and visualizing logs. Although these tools allow users to relatively easily perform ad-hoc queries and define rules in order to generate alerts, they do not provide automated log parsing support. In particular, most of these systems use regular expressions (regex) to parse log messages. These tools assume that the administrators know how to work with regex, and make the admins manually parse and define the fields of interest. By definition, these tools support only supervised parsing as human input is essential. However, human involvement is clearly non-scalable for heterogeneous and continuously evolving log message formats in systems such as IoT, and it is humanly impossible to manually review the sheer number of log entries generated in an hour, let alone days and weeks. On top of that, writing regex-based parsing rules is long, frustrating, error-prone, and regex rules may conflict with each other especially for IoT-like systems. In this talk, we describe how we automatically generate regex rules based on the log data, which is described further in our research work, LogMine: Fast Pattern Recognition for Log Analytics, published at the CIKM 2016 conference. We also show a demo to illustrate how to integrate our solution with the popular ELK stack.

Failure Happens: Improving Incident Response in Large-Scale Organizations

Damon Edwards (Rundeck, Inc.)

Deployment is a solved problem. Yes, there is still work to be done, but the operations community has successfully proven that we can both scale deployment automation and distribute the capability to execute deployments. Now, we have to turn our attention to the next critical constraint: What happens after deployment? We all know that failure is inevitable and is coming our way at any moment. How do respond quickly and effectively to those failures? What works when there is just a small set of teams or an isolated system to manage will quickly break down when the organization grows in size and complexity. But on the other hand, what has been commonly practiced in large-scale enterprises is proving to be too cumbersome, too silo dependent, and simply too slow for today’s business needs. How do we rapidly respond to incidents and recover complex interdependent systems while working within an equally complex and interdependent organization? How does operations embrace the DevOps and Agile inspired demand for speed and self-service while maintaining quality and control? This talk examines the trial-and-error lessons learned by some forward-thinking enterprises who are currently streamlining how they: -Resolve incidents -Reduce friction between teams -Divide up operational responsibilities -Improve the quality of their ongoing operations. -See how these companies are rethinking how and where operations happens by applying Lean and DevOps principles mixed with modern tooling practices.

Opening Plenaries: Security in Automation

Jamesha Fisher (GitHub)

Jamesha Fisher has worked in the Tech industry for over 10 years, with a keen eye towards security. Currently a Security Operations Engineer at GitHub, they have lent their security expertise throughout their career in Operations and Systems Engineering to other companies including Google and CloudPassage. In their spare time they are a maker of things musical, delicious, and objects that use binary numbers.

Queueing Theory in Practice: Performance Modeling for the Working Engineer

Eben Freeman (

Cloud! Autoscaling! Kubernetes! Etc! In theory, it’s easier than ever to scale a service based on variable demand. In practice, it’s still hard to take observed metrics, and translate them into quantitative predictions about what will happen to service performance as load changes. Resource limits are often chosen by guesstimation, and teams are likely to find themselves reacting to slowdowns and bottlenecks, rather than anticipating them. Queueing theory can help, by treating large-scale software systems as mathematical models that you can rigorously reason about. But it’s not necessarily easy to translate between real-world systems and textbook models. This talk will cover practical techniques for turning operational data into actionable predictions. We’ll show how to use the Universal Scalability Law to develop a model of system performance, and how to leverage that model to make more informed capacity planning and architectural decisions. We’ll discuss what data to gather in production to better inform its predictions — for example, why it’s important to capture the shape of a latency distribution, and not just a few percentiles. We’ll also talk about some of the limitations and pitfalls of performance modelling.

Clarifying Zero Trust: The Model, the Philosophy, the Ethos

Evan Gilman & Doug Barth (Scytale) (Stripe)

The world is changing, though our network security models have had difficulty keeping up. In a time where remote work is regular and cloud mobility is paramount, the perimeter security model is showing its age—badly. We deal with VPN tunnel overhead and management. We spend millions on fault-tolerant perimeter firewalls. We carefully manage all entry and exit points on the network, yet still we see ever-worsening breaches year over year. The Zero Trust model aims to solve these problems. Zero Trust networks are built with security at the forefront. No packet is trusted without cryptographic signatures. Policy is constructed using software and user identity rather than IP addresses. Physical location and network topology no longer matter. The Zero Trust model is very unique, indeed. In this talk, we’ll discuss the philosophy and origin of the Zero Trust model, why it’s needed, and what it brings to the table.

Fast and Safe Production Monitoring of JVM Applications with BPF Magic

Sasha Goldshtein (Sela Group)

All of us have seen these evasive performance issues or production bugs in the field, which standard monitoring tools don’t see or catch. BPF is a Linux kernel technology that enables fast, safe, dynamic tracing of a running system without any preparation or instrumentation in advance. The JVM itself has a myriad of insertion points for tracing garbage collections, object allocations, JNI calls, and even method calls with extended probes. When the JVM tracepoints don’t cut it, the Linux kernel and libraries allow tracing system calls, network packets, scheduler events, off-CPU time, time blocked on disk accesses, and even database queries. In this talk, we will see a holistic set of BPF-based tools for monitoring JVM applications on Linux, and revisit a systems performance checklist that includes classics like fileslower, opensnoop, and strace—all based on the non-invasive, fast, and safe BPF technology.

Linux Container Performance Analysis

Brendan Gregg (Netflix)

Containers pose interesting challenges for performance monitoring and analysis, requiring new analysis methodologies and tooling. Resource-oriented analysis, as is common with systems performance tools and GUIs, must now account for both hardware limits and soft limits, as implemented using cgroups. A reverse diagnosis methodology can be applied to identify whether a container is resource constrained, and by which hard or soft resource. The interaction between the host and containers can also be examined, and noisy neighbors identified or exonerated. Performance tooling can need special usage or workarounds to function properly from within a container or on the host, to deal with different privilege levels and name spaces. At Netflix, we’re using containers for some microservices, and care very much about analyzing and tuning our containers to be as fast and efficient as possible. This talk will show you how to identify bottlenecks in the host or container configuration, in the applications by profiling in a container environment, and how to dig deeper into kernel and container internals.

SREBot—More Than a Chatbot—An Intelligent Bot to Crush Mitigation Time

Cezar Alevatto Guimaraes (Microsoft)

SREBot is a knowledgeable and intelligent engine that replaces tribal knowledge and automates incident management activities. It is also extensible, allowing other teams to add their own knowledge. In this talk you will hear how SREBot is being developed and used to reduce the Time to Mitigate (TTM) Microsoft incidents. We will explain how it was designed and then share the main issues we are facing.

Sample Your Traffic but Keep the Good Stuff!

Ben Hartshorne (

The two main methods of reducing high volume instrumentation data to a manageable load are aggregation and sampling. Aggregation is well understood, but sampling remains a mystery. We’ll start by laying down the basic ground rules for sampling—what it means and how to implement the simplest methods. There are many ways to think about sampling, but with a good starting point, you gain immense flexibility. Once we have the basics of what it means to sample, we’ll look at some different traffic patterns and the effect of sampling on each. When do you lose visibility into your service with simple sampling methods? What can you do about it? Given the patterns of traffic in a modern web infrastructure, there are some solid methods to change how you think about sampling in a way that lets you keep visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service. Taking it a step further, you can push these sampling methods beyond their expected boundaries by using feedback from your service and its volume to affect your sampling rates! Your application knows best how the traffic flowing through it varies; allowing it to decide how to sample the instrumentation can give you the ability to reduce total throughput by an order of magnitude while still maintaining the necessary visibility into the parts of the system that matter most. I’ll finish by bringing up some examples of dynamic sampling in our own infrastructure and talk about how it lets us see individual events of interest while keeping only 1/1000th of the overall traffic.

Wait for Us! Evolving On-Call as Your Company Grows

Christopher Hoey (Datadog)

The talk will start with a quick overview of the rapid growth Datadog experienced and the resulting challenges. This is done to illustrate the eventual challenges where a simple primary and secondary on-call team starts to fall apart. In hindsight the signs are obvious however in the thick of it all it is hard to step back and realize the on-call team and processes were falling apart. It should be said that what was in place worked and met its needs for a long time. You have to start somewhere. The evolution is what I focus on while sharing the tricks to make that evolution easier. The talk will then go into some of the patterns Datadog found useful such as refining our incident management processes and roles, growing the depth of the oncall team, eventually switching to per team rotations and the challenges involved through this evolution. We will highlight some of the useful tricks and tools Datadog have used such as: Structured service templates to help with on-call training On-call training and shadow ops rotations The use of Github Issues to track on-call tasks for handoff and to use as training examples Scheduled on-call handoffs that include systematically reviewing the sources of alerts to kill noise Providing a way to capture monitor feedback from every alert notification Patterns of using Github projects to track where each on-call member stands as far as service training Scripts to use in conjunction with the service templates and on-call scheduling to show each on-call member a list of what changed since the last time they were on-call

An Internet of Governments: How Policymakers Became Interested in “Cyber”

Maarten Van Horenbeeck (Fastly, Inc.)

Gradually, the internet has become a bigger part of how we socialize, do business, and lead our daily lives. Though they typically do not own much of the infrastructure, governments have taken ever-increasing note, often aspirational, and sometimes with suspicion. In this talk, we’ll cover how governments internationally debate and work on topics of cybersecurity, agree on what the challenges are, and get inspiration on solutions. The talk will show how these concerns often originate from domestic concerns, but then enter several processes in which governments meet, debate, agree, and disagree on their solutions. You’ll learn about initiatives such as the ITU, the UNGGE, the Global Conference on Cyberspace, and the Internet Governance Forum, and how you as an engineer can contribute!

Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC

Michael Jennings (Los Alamos National Laboratory)

Supercomputing centers are seeing increasing demand for user-defined software stacks (UDSS), instead of or in addition to the stack provided by the center. These UDSS support user needs such as complex dependencies or build requirements, externally required configurations, portability, and consistency. The challenge for centers is to provide these services in a usable manner while minimizing the risks: security, support burden, missing functionality, and performance. We present Charliecloud, which uses the Linux user and mount namespaces to run industry-standard Docker containers with no privileged operations or daemons on center resources. Our simple approach avoids most security risks while maintaining access to the performance and functionality already on offer, doing so in just 900 lines of code. Charliecloud promises to bring an industry-standard UDSS user workflow to existing, minimally altered HPC resources.

Pintrace: A Distributed Tracing Pipeline

Suman Karumuri (Pinterest)

Speed improves customer engagement. With the emergence of micro services, it is very common for a single customer interaction, such as loading the home page or querying a search end point, to invoke hundreds of calls to dozens of back-end services. In this multi-tenant environment, traditional monitoring and profiling tools can’t tell us why a specific request was slow. Distributed tracing is the only tool available today that lets us trace a request across several systems. Using the gathered traces, we can correctly debug how a specific request is processed across the service, understand where an application spent most of its time and gain insight into why a particular request was slow. In this talk, I will present PinTrace, our zipkin based distributed tracing infrastructure. I will also talk about the challenges of instrumenting and deploying the tracing in a polyglot micro-services architecture at scale. I will also share a few examples of how we use traces from production to debug p99 latency issues, identify unnecessary network calls and performance bottlenecks in the system. I will conclude the talk with a few use cases of distributed tracing beyond performance optimization like architectural visualization.

System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields

Jon Kuroda (University of California, Berkeley)

Commercial aviation, civil and structural engineering, emergency medicine, and the nuclear power industry all have hard-earned lessons gained over their respective histories, histories that stretch back decades or even centuries. Often acquired at a bloody cost, these experiences led to the development of environments typified by stringent regulation, strict test and design protocols, and demanding training and education requirements—all driven by a need to minimize loss of life. In stark contrast, the computer industry in general and systems administration specifically have developed in a relatively unrestricted environment, largely free, outside of a few niche fields, from the regulation and external control seen in life-safety critical fields. However, despite these major differences, these far more demanding environments still have many lessons to offer systems administrators and systems designers and engineers to apply to the design, development, and operation of computing systems. We will look at incidents ranging from Air France 447 to Three Mile Island and what we can learn from the experiences of those involved both in the incidents and the subsequent investigations. We will draw parallels between our field as a whole and these other less forgiving fields in areas such as Education and Training, Monitoring, Design and Testing, Human Computer/Systems Interaction, Human Performance Factors, Organizational Culture, and Team Building. We hope that you will take away not just a list of object lessons but also a new perspective and lens through which to view the work you do and the environment in which you do it.

DevOps in Regulatory Spaces: It’s Only 25% What You Thought It Was…

Peter Lega (Merck and Company)

You’ve embraced the DevOps concepts, found your sympaticos, established a solid technical ecosystem and culture, and even delivered some great early results with a first follower portfolio. Now, you have entered the mission-critical regulatory problem space at scale. The traditional DevOps “goodness” and culture have taken you this far. Now it’s time to scale, with a whole new set of regulatory and compliance constituents and technical maturity needs. In this talk, we will share the “first contact” experience with the long established regulatory community as we embarked on delivering larger complex solutions and the challenges and compelling opportunities to transform that have unfolded to enable compliance as code from portfolio through production.

Testing Before You Scale & Making Friends While You Do It

Renee Lung (PagerDuty)

Your customers shouldn’t find problems before you do. When we develop software and make architectural decisions, we try to anticipate potential problems—ambiguous user interfaces, performance bottlenecks, and other edge cases. Generally we do a good job of it, but as system complexity grows, the mental models we use to plan and understand those structures don’t always adequately accommodate those complexities. So what do we do about this? We can test all the things! By using automation, we test complex scaling scenarios to validate our mental models and to identify unanticipated side-effects. One of the issues we recently dealt with was supporting a major change in our traffic patterns. Although overall load stayed the same, the stress points produced by that load changed significantly. Major shifts like these always have the potential to disrupt our service, and in turn, disrupt our customers’ ability to keep their systems running. We had some predictions about how our system would react to the new load profile, but we wanted to validate those predictions ourselves rather than waiting for our customers to experience service degradation. Although each engineering team had some idea of how these changes would affect the performance of their own services and had work scheduled to address those issues, I wanted to make sure we were all equipped to make informed prioritization and planning decisions. All I had to do was figure out a way to consolidate the efforts of more than 90 engineers into one focused attack on our scaling challenges. Fortunately, I didn’t have to start from scratch: I could build on existing attitudes of collaboration, ownership, and a culture of reliability which has resulted in a rich toolset for testing resilience and scalability. This talk will outline how we used those tools, developed new ones, what we learned in the process, and the challenges of consolidating the efforts of separate teams towards a specific, common initiative.

UX Design and Education for Effective Monitoring Tools

Amy Nguyen (Stripe)

The fastest way to become a 10X engineer is by enabling 10 other engineers to do their jobs better. As infrastructure engineers, part of our mission is to empower the rest of our engineering organization to use the tools we develop correctly, quickly, and independently. Yet we often fall short of that mission in unexpected ways. In this talk, I will explain ways to make concepts like interpolation, aggregation, and alerting more intuitive and how to identify pain points for new users. I’ll go over common misconceptions users have about monitoring and how you can clear up this confusion with improved training and UI design.

ChatOps at Shopify: Inviting Bots in Our Day-to-Day Operations

Daniella Niyonkuru (Shopify)

ChatOps has already been identified as instrumental for DevOps success. In this talk, I will describe how we use chatbots to accelerate developer onboarding, increasing developer productivity and manage service disruption incidents. ChatOps is about bringing tools into your conversations and using them to interact with the infrastructure. It traditionally combines a chatbot, key plugins and scripts. I will describe how we integrate these to perform actions related to the infrastructure such as rebalancing traffic, querying the infrastructure state, and other various actions.

Your Secrets in Cloud-Based Key Management Services

Dan O’Boyle (Stack Overflow)

Do you encrypt secrets before committing them to a repository? Are API keys and passwords stored in a local library any team member can decrypt? Are you forced to re-encrypt all secrets anytime access has changed? Stop doing those things! Cloud Based Key Management Services (Google KMS, Azure Key Vault, Amazon KMS) provide encryption keys as a service. KMS create a centralized access control list. Using a KMS, you can centralize secrets, removing them from local libraries. Key rotation can be automated, often times making a KMS more secure than local key management practices.

Vax to K8s: Ticketmaster’s Transformation to Cloud Native Devops

Heather Osborn (Ticketmaster)

When you have a 40 year old company deeply rooted in legacy technologies, the work required to reinvent is dramatic. I will share how we’ve handled this journey so far, our successes and failures, and where we’re going in the future. Moving from a siloed on-premises environment to a DevOps cloud-native company has not been without discomfort. The time-to-market improvements and increased visibility of problems has evolved us into a more agile company that has the potential to keep pace with startups.

Resiliency Testing with Toxiproxy

Jake Pittis (Shopify)

Fibers get cut, databases crash, and you’ve adopted Chaos Engineering to challenge your production environment as much as possible. But what are you doing to craft the resiliency test suites that minimizes the impact of your application as much as possible? How do you debug resiliency problems locally, and make sure to architect for robustness as development-time? Toxiproxy is an open-source tool we’ve used for the past 2 years to emulate timeouts, latency and outages and one we believe could benefit nearly every company faced with these issues. In this talk, we’ll dive into practical tips, lessons learned, and best practices so you can use Toxiproxy to write resilient test suites.

Capacity and Stability Patterns

Brian Pitts (Eventbrite)

At Eventbrite, engineers are tasked with building systems that can withstand dramatic spikes in load when popular events go on sale. There are patterns that help us do this: * Bulkheads: partitioning systems to prevent cascading failures * Canary testing: Slowly rolling out new code * Graceful degradation: turning functionality on and off in response to failures or load * Rate limiting: controlling the amount of work you accept * Timeouts: limiting time you wait for a request you made to complete * Load shedding: purposefully not handling some requests in order to reserve resources for others * Caching: saving and re-serving results to lessen expensive requests * Planning: getting the resources you need in place, before you need them In this talk you learn about each of those patterns, how Eventbrite has adopted them, and how to implement them within your own code and infrastructure.

Never Events

Matt Provost (Yelp)

The NHS is the United Kingdom’s National Health Service, established in 1948 to provide free healthcare at point of service to all 64.6 million UK residents. In England’s National Health Service (NHS), a Never Event is a serious incident that “arise[s] from [the] failure of strong systemic protective barriers which can be defined as successful, reliable and comprehensive safeguards or remedies”. They key criteria for defining Never Events is that they are preventable and have the potential to cause serious patient harm or death. All Never Events are reportable and undergo Root Cause Analysis to determine why the failure occurred, to prevent similar incidents from happening again. Show more...

Considering that the NHS is a healthcare service where incidents can obviously have serious, life-threatening or life-changing consequences, together with the scale of services provided (the NHS in England deals with over 1 million patients every 36 hours), their list of Never Events is actually quite short (14 events), including such items as “Wrong site surgery”, “Retained foreign object post-procedure”, and “Wrong route administration of medication.” In our industry, the requirement for these events to be preventable would exclude things like DDOS attacks or security breaches which are outside of the SRE team’s direct control. Of course steps should be taken to minimise or prevent these types of incidents, the same way that doctors work to prevent patients from dying of cancer. But they don’t cause cancer, so a patient dying of it is not a Never Event. However, a nurse administering the wrong type of cancer medication, or cancer medication to the wrong patient, or delivering the medication via the wrong route (intravenous vs spinal etc) can all be Never Events. If there are insufficient processes in place to prevent such mistakes, then they cannot be Never Events. This system is designed to protect the staff as well as patients, so that they aren’t put under pressure to be perfect. There must be procedures in place so that it doesn’t come down to an individual to make all of the correct choices on their own. Events are a fundamental part of the safety culture of the NHS which is a “just culture that rejects blame as a tool.” In recent years, modern systems safety concepts such as just culture and blameless postmortems have been introduced to the System Administration/Site Reliability Engineering/Devops community from other fields (such as healthcare). However the concept of defining specific Never Events has not been explored in this context and can bring similar benefits to those reported by the healthcare community with a reduction in the reoccurrence of such events. Many systems engineering organisations already have their own formal or informal guidelines for reportable events. Publishing postmortems (either internally or public facing) is now becoming standard practise in our industry, but not all of these events are Never Events. These incidents should be studied by each organisation after each postmortem to generate a list of failures that should never occur again because safety systems/protective barriers have been put in place to prevent them. Any occurrence of such an incident after the fact is therefore a Never Event. The goal of implementing the Never Events system is firstly to reduce the number of these serious events, but also to protect staff and to provide a safe working environment. Repeated Never Events indicate that management has not addressed the underlying causes of these incidents, which shifts responsibility away from the front line staff who are operating in (clearly) unsafe conditions or with inadequate safety systems in place to prevent these events. While each organisation will come up with its own list of Never Events for their specific environment based on their examination and analysis of previous incidents, some generalisations can be made. For example, looking at “Wrong Site Surgery” from the NHS list, where the wrong part of the body is operated on (left vs right leg etc). This is a process failure, where the staff may do the correct procedure but to the wrong location. Transferring that to the systems administration world, this is analogous to running the correct command on the wrong system. During their careers, most (if not all) system administrators have made certain classes of similar mistakes such as rebooting the wrong server, removing the wrong directory (including the classic “rm -rf /”) or executing a SQL DELETE statement without a WHERE clause. We will examine the steps the NHS has taken to prevent this type of “wrong site” incident, along with other Never Events. By learning from other industries we can come up with recommendations for preventing similar mistakes in our field.

“Don’t You Know Who I Am?!” The Danger of Celebrity in Tech

Corey Quinn (Last Week in AWS)

Thought Leaders. DevOps Heroes. Public Speakers. We listen to them as they talk about their solutions, their approaches, and their inevitable triumphs. But are we starting down a dark path, as we forget that ‘what makes a great talk’ and ‘what makes sense for your environment’ may not be the same thing? In this entertaining and slightly irreverent talk, the speaker discusses the dangers of taking others’ experiences as a source of absolute truth. A discussion of how the innovative and clever solutions that headline various conference talks may very well not apply to your environment will ensue, including but not limited to: * Matching business requirements to what technologies can deliver * The trap of feeling like you’re falling behind if you’re not doing what the “bleeding edge” companies are * At least one story of how following this approach went hilariously wrong

Scaling Talent: Attracting and Retaining a Diverse Workforce


Moderator: Tameika Reed, Founder of WomenInLinux Panelists: Derek Arnold; Amy Nguyen, Stripe; Qianna Patterson, QP Advisors; Wayne Sutton, Co-Founder, CTO, Change Catalyst

Show more...

Derek Arnold has worked in many different parts of technology across multiple sectors as a system administrator, developer and instructor in the telecommunications, manufacturing, education and goverment sectors for the last 20 years. Amy Nguyen is a software engineer passionate about making data understandable for everyone. In the past, she studied computer science and philosophy at Stanford University, served on the board of Stanford Women in Computer Science for three years, and helped in making computer science the most popular major for female undergraduates during her time there. Outside of work, Amy writes about the tech industry, loves baking, and reads too many self-improvement books. Qiana Patterson is a seasoned tech executive, specializing in K12 education, higher education and workforce development. With her, she brings over 10 years experience in the education sector and a wealth of leadership and project/product management expertise in the technology industry. She was the founding COO for Edlio an LA-based K12 edtech company, prior she served as the Interim CEO of Educational Networks a leading content management software platform company. While at Educational Networks, she served as a lead manager in almost all areas and teams of the company. Before Educational Networks, Qiana worked as a teacher and Dean of Students in the Los Angeles Unified School District. She currently leads her own tech consulting firm, QP Advisors, where she helps startups to mature companies develop products customers love. Wayne Sutton is a serial entrepreneur and co-founder of Change Catalyst and its Tech Inclusion programs. Change Catalyst is dedicated to exploring innovative solutions to diversity and inclusion in tech through the Tech Inclusion Conference, training, workshops, and the Change Catalyst Startup Fellows Program. Sutton’s experience includes years of establishing partnerships with large brands to early stage startups. As a leading voice in diversity and inclusion in tech, Sutton shares his thoughts on solutions and culture in various media outlets where he has been featured in TechCrunch, USA Today, and the Wall Street Journal. In addition to mentoring and advising early stage startups, Sutton’s life goal is to educate entrepreneurs who are passionate about using technology to change the world. Wayne is a 2017 New America CA Fellow.

Have You Tried Turning It Off and Turning It On Again?

Tanya Reilly (Google)

Most of us have a backup strategy and many of us have a restore strategy and several of us have a fully tested restore strategy. But backups are far from the whole story! This talk covers the parts of disaster recovery you might be less prepared for, and the dependencies that you might not think about until one day when you really do turn an entire service, entire site or (perish the thought!) an entire company off and on again. We’ll look at why the best laid fallback plans tend to go wrong, and why you should start deliberately managing your dependencies long before you think you need to. And we’ll look at dependency cycles that make it difficult or impossible to restart groups of systems. Like, where do you store the documentation on how to recover the documentation server?

Disaggregating the Network: Switching as a Service

Nina Schiff (Facebook)

At Facebook, we’ve traditionally focused on disaggregation through most of our systems. This has helped us to iterate faster, harden where needed and scale out our bottlenecks more easily. However, in the network, we have had very little control over the switching ecosystem, making us reliant on the timelines of other companies. Adaptability and customization are not typically what comes to mind when people think about network switches. Hardware is often proprietary, and if you’re buying a vendor switch, you don’t control the frequency or speed of new features or bug fixes. These constraints are inconvenient at best, particularly for large production environments. This led us to try something different – disaggregating the hardware components and software workflow, into Wedge and FBOSS respectively. We also moved to make our switches look significantly more like traditional servers. While this has brought new (and definitely interesting) challenges, it has also meant that we’ve been able to piggyback off advances in server management. This talk takes a look at this composite architecture within our production setting while examining the lessons we learnt along the way. It also highlights how having a server as a switch helps us iterate faster, provides a more reliable network and meets the scaling demands of Facebook’s ever-increasing traffic growth.

Scalability Is Quantifiable: The Universal Scalability Law

Baron Schwartz (VividCortex)

Do you know what scalability really is? It’s a mathematical function that’s simple, precise, and useful. REALLY useful. It describes the relationship between system performance and load. In this talk you’ll learn the function (the Universal Scalability Law), how it describes and predicts system behavior you see every day, and how to use it in practice. I’ll show you how to understand the function, how to capture the data you need to measure your own system’s behavior (you probably already have that), and how to analyze the data with the USL. You’ll leave this talk knowing exactly what scalability is and what causes non-linear scaling. There are two factors, and you’ll start seeing those everywhere, too. As a result, when systems don’t scale you’ll know what kind of problem to look for, and you’ll avoid building bottlenecks into your systems in the first place. Final note: this talk requires zero mathematical skill.

Now You See Me Too: Visual Tooling for Advanced System Analysis

Suchakrapani Sharma (ShiftLeft Inc.)

Command line tools ensure lowest friction and entry bar for system analysis. However, visual analysis yields more information in a shorter amount of time. As an example, when an application crashes or an elusive transient bug occurs, understanding of callstack that led to the anomaly is a valuable information. Recording such function call graphs of the application and displaying them on the command line as huge chunks of text has been a common occurrence and a quick resort for such analyses. However, methodical analysis requires better visuals. Modern representations, such as FlameGraphs, FlameCharts, and Sun-bursts in such cases, have shown how effective the same analysis can be, when represented visually. However, there are hundreds of techniques to gather trace/debug data, and understanding of what visual tool to represent which data can be a daunting task. This talk focuses on the various visual tools available for common system analysis and debugging scenarios. We explore some open source tools used in system tracing and the representation formats for such data comping from multiple sources such as LTTng and eBPF. We explore historical origins of such visual representations and see the evolution of research ideas to concrete modern tools. We also discuss how in a few minutes you can easily enhance the same tools and develop new views to visualize a wide range of data—from network capture, Container/VM tracing to even hardware traces coming directly from CPUs—all in the same tool.

Managing SSH Access without Managing SSH Keys

Niall Sheridan (Intercom)

Everyone uses SSH to manage their production infrastructure, but it’s really difficult to do a good job of managing SSH keys. Many organisations don’t know how many SSH keys have access to production systems or how protected those keys are. A trusted SSH private key can be years old, unprotected by passphrase, and shared among multiple people who may not even work for you. With some tooling and configuration SSH keys can be replaced with limited-use ephemeral certificates, issued centrally and with better access controls and automatic key expiration, solving many of the shortcomings of using SSH keys. This talk will cover: * Managing SSH keys: The bad parts * Replacing SSH keys with ephemeral certificates: how & why * Discussion of an implementation of a CA for SSH certificates * Call for participation, showing github source

Calcifying Crisis Readiness

Rock Stevens (University of Maryland)

No organization is immune to data breaches, insider threats, and other cyber attacks. An unprepared organization can exacerbate the impact of these threats, leading to a loss of consumer trust and confidence. In this talk, I propose a radically new training method for preparedness that fuses together concepts like the Netflix “Chaos Monkey” with U.S. military “react to contact” drills. Your organization, from technicians to C-level executives, can immediately adopt this proposal to mitigate future threats and lessen the effect of successful attacks.

Debugging at Scale Using Elastic and Machine Learning

Mohit Suley (Microsoft)

Engineers are well-tuned with debugging issues on a single machine. However, when the architecture scales out to possibly hundreds or thousands of machines with components 10+ layers deep, debugging doesn’t look the same anymore. The concept of looking at logs becomes ‘collective’ in nature and looking for patterns in logs is the only viable way of associating them with the problems you are trying to solve. We will walk through motivation for building such a system and how it differs from traditional monitoring and debugging. A system designed this way collects all needed artifacts, identifies known/unknown patterns in error messages, correlates with infrastructure serving these errors, and allows outlier service components to be exposed within 10-15 minutes of a developing problem trend.

LinkedIn’s Distributed Firewall

Mike Svoboda (LinkedIn)

Distributed Firewall (DFW) has fundamentally altered LinkedIn’s System, Network, and Security Operations. This technology has enabled LinkedIn to expand with unbound horizontal scalability by leveraging Software Defined Networking. Combining system automation with host based firewalls, DFW has not only allowed LinkedIn to alter the physical network design, but it has also increased the security protections that we can now provide in Production environments. In this presentation, we will share how LinkedIn was able to remove physical and logical network firewall bottlenecks. By shifting network security enforcement down to the per-host level, DFW enables LinkedIn to fully utilize datacenter power, cooling, and space facilities by intermixing heterogeneous environments within the same physical rack and network footprint. Integrating DFW with LinkedIn’s code deployment system, the firewall has become aware of the specific application requirements on each node, and can build a unique security profile to secure the hosted services. We will demonstrate DFW in action, point to the open source code, and will share lessons learned from our Production implementation so other organizations could leverage this technology.

Operational Compliance: From Requirements to Reality

Trevor Vaughan (Onyx Point, Inc.)

A mere mention of compliance is one of those things that makes most teams throw up their hands in frustration. We would like to share how our Government and Industry customers have successfully approached Compliance Driven Operations and how to use standard development and engineering methodologies to address compliance concerns in a practical manner. Specific techniques and technologies will be mentioned that can help teams approach Compliance as ‘just another set of requirements’ and understand how to communicate effectively with security teams and auditors.

The Actor Model and the Queue or “Batch is the New Black”

James Whitehead II (Formularity)

This presentation will explain how two simple, decades-old computer paradigms can be combined and used to build the world’s largest and most resilient computing solutions. Real world examples will be shown. Show more...

In 1974, Carl Hewitt published his paper on the Actor model. In computing, an Actor is a computer program that uses information fed to it through messages to 1) create new Actors, 2) send messages to other Actors, and 3) make limited, often binary, decisions. Just as the binary on-off state of a single transistor can be built into the 2.6 billion(!) transistor Intel i7 microprocessor, Actors can be built into the most complex processing systems. If the Actor model sounds familiar, it’s because it is the basis for Microservices, one of the hottest new topics in cloud computing. Just another example that “…what has been will be again, what has been done will be done again; there is nothing new under the sun.” The Actor Model is only half of the solution. The key to using Actors to build infinitely scalable real-world systems is how you connect them together. Typically, in Microservices, you send or “push” messages from one Microservice to another. When you reach the throughput of a Microservice instance, you clone a few more instances. When you reach the CPU or memory utilization limits of the virtual machine, you fire up more VMs. The key is that you “push” messages. This however, is the wrong approach. We all know what happens when you push something hard enough—it will fall over. Think of the classic scene from the “I Love Lucy” television program where Lucille Ball is wrapping chocolate candies on a conveyer belt. This graphically demonstrates that the “push” model is the wrong approach. In Douglas Adam’s “The Hitchhiker’s Guide to the Galaxy”, the quote is “We’ll be saying a big hello to all intelligent lifeforms everywhere and to everyone else out there, the secret is to bang the rocks together, guys.” To paraphrase Mr. Adams, the secret to scalable processing systems is really to “pull”, not “push” messages between Actors. Rather than send messages directly between Actors, the messages are deposited into queues from which Actors can “pull” messages. As each Actor becomes available, it pulls the next message out of the queue and processes it. This has a number of advantages over “pushing” messages, such as increased Actor process stability, load balancing, predictive monitoring, and transparent redundancy. Actors are computer programs and as such they aren’t lazy. An Actor will process messages as fast as its execution environment permits. If messages begin to back up in a queue, then you know, long before it becomes critical, that more Actor processes are required. As these new Actor processes become available, there is no need to add them to a load balancer. Each new Actor connects to the same queue and starts asynchronously removing and processing messages. Similarly, when queues become empty, redundant Actors can be terminated. Finally, by using network routing, it’s possible to route messages to redundant queues. If the primary queue fails, Actors can “failover” to a redundant queue and continue processing without message loss. While the Actor model is 42 years old, the queue data structure was originally described by Alan Turing 70 years ago, in a paper published in 1947! While these two “ancient” computing paradigms form the basis for modern, infinitely scaling systems, there are a number of details that must be dealt with, including how to handle work lost when Actors fail; how to maintain state or context; how to handle long-running processes; how to handle “split brain” network failures in light of redundant messages queues; synchronization of redundant message queues, etc. This presentation will discuss these issues as well. The goal of the presentation is to outline for software developers, the framework they can use to develop highly scalable, highly resilient processing systems.

The Hidden Costs of On-Call: False Alarms

Cody Wilbourn (

On-call teams, postmortems, and costs of downtime are well-covered topics of DevOps. What’s not spoken of is the costs of false alarms in your alerting. The team’s ability to effectively handle true issues is hindered by this noise. What are these hidden costs, and how do you eliminate false alarms? While you’re at LISA17, how many monitoring emails do you expect to receive? 50? 100? How many of those need someone’s intervention? Odds are you won’t need to go off into a corner with your laptop to fix something critical on all of those emails. Noisy monitoring system defaults and un-tuned alerts barrage us with information that isn’t necessary. Those false alerts have a cost, even if it’s not directly attributable to payroll. We’ll walk through some of these costs, their dollar impacts on companies, and strategies to reduce the false alarms.

Becoming a Plumber: Building Deployment Pipelines

Daniel Barker (DST Systems)

A core part of our IT transformation program is the implementation of deployment pipelines for every application. Attendees will learn how to build abstract pipelines that will allow multiple types of applications to fit the same basic pipeline structure. This has been a big win for injecting change and updating legacy applications.

DevOps Exchange (London, UK)

Ticketmaster’s Journey to DevOps

Connon MacRae (Ticketmaster)

Ticketmaster’s journey through DevOps and what we’re doing next to prepare our org, and technology, for the future.

DevOps? Oh yeah, we do that too…

Matthew Macdonald-Wallace (Mockingbird Consulting)

Matthew Macdonald-Wallace (@proffalken on Twitter) has been around a while, possibly a bit too long. He remembers when “immutable infrastructure” meant Norton Ghost, and network issues were usually caused by someone kicking a cable in the datacentre. Join Matthew as he rants about why DevOps is about more than tools, the death of the sysadmin, the rise of the programmer, and why he started his own consultancy in a world where the true meaning of DevOps has all but been eroded by marketing and sales teams.

DevOpsDays Warsaw (Warsaw, Poland)

Kubernetes Meets Infrastructure As Code

Kamil Rogoń (Container Labs)

Think as a code for your infrastructure – this is crucial for real DevOps experience. During the session Kamil will summary how Kubernetes is designed to meet this requirement. But that’s not all. Kamil will share how kops can handle the installation / upgrades and all 2nd day operations of the infrastructure satisfying first paradigm of modern container orchestration.

Kamil Rogoń is a DevOps Architect. He brings systems administration experience of both traditional IT infrastructure and production Public Cloud usage. Kamil used to drive industry changes within Intel while working on improving Software Defined Infrastructure stacks. His effort is now oriented towards helping customers in building clouds for container orchestration, which fits into changing IT in modern way. Known as a security and privacy paranoid, he believes that the automation of everything saves a lot of valuable DevOps time.

Security At Devops Speed

Ryan Sheldrake (Sonatype)

Software development is pressed for faster and faster release cycles with acceptable quality, budget and security. As movements like CI, CD and DevOps aim to cut down on release cycles, it’s security’s job to help control the risk. The risk landscape is complex as modern development practices increasingly consume more and more third party code. Traditional methods do not cut it anymore – it’s time for DevSecOps. This session gives an overview of how companies have implemented DevSecOps practices in their own delivery pipelines and how this can help increase developer awareness of risks affecting them. We’ll walk an example CICD Pipeline and explore how security has been embedded as a part of it, how the movement is shaping up and how standards are starting to follow suite.

Ryan Sheldrake Experienced thought leader in the DevSecOps space. Leveraging over 15 years enterprise IT experience in the financial and retails sectors. Working to change the way the industry thinks about secure software delivery. Build quality in!

What Can Possibly Go Wrong With Your DevOps Transformation?

Tomasz Pająk (Seqr)

Successful DevOps transformation embraces three, equally important areas: (1) optimizing flow of value, (2) optimizing flow of feedback in the opposite direction to the flow of value and (3) culture based on continuous learning and experimentation. I would like to discuss challenges one may face during the transformation. It will be based on my experience from Seqr as an Engineering Manager and other companies doing a change which I was a witness of as a consultant/coach. I can name three main reasons of those challenges: people/organizational culture, processes and technology/architecture. The exact way one should cope with them usually differs between organizations as one deals with complex systems in which solutions cannot just be copied to obtain the same result. Nevertheless, I would like to provide you with real-life examples and generalize them with ideas behind any DevOps transformation. Hopefully, when equipped with them you will be able on your own to find a solution suitable for your organization.

Tomasz is Software Engineering Manager at fintech company Seqr where he faces challenges of developing a disruptive product (mobile payments). He specialises in building high-performance organisations by Agile and DevOps transformations. Tomasz shares his experience as a speaker at multiple international conferences (eg. Agile Cambridge, DevOpsDays, Agile Management Congress, Agile Lean Europe etc.) and as a contributor to InfoQ

Safe Container Lashing

Viktorija Almazowa (Cloudworks)

Containers became as a way to go in many companies. It is not surprise as the benefits of flexibility and short time of release are undoubtable. Having only what is needed in containers from application perspective is great. But what about security? Is it more secure? Or? If so what shall be taken into account when deploying containers in production? During talk we will go thru main security principals, which should not be forgotten to make solution not fast deployed but also secure. Technical examples will cover Docker usage and some useful tools, which helps in hardening of containers.

Viktorija Almazowa Cloud Security Architect for Cloudworks with experience more than 10 years in security. She spends all her time working closely with developers and architects to make security built in from design level. She is a big supporter of making security as culture and shifting security to the left. Viktorija believes that empowering developers and architects in security tasks by helping with education will increase security level without increasing additional workload. During free time she deep dives into Azure security, development, identity and access management.

Kubernetes Or OpenShift – Kubernetes Or OpenShift – Choosing Container Platform For Dev And Ops

Tomasz Cholewa (Mindbox)

Kubernetes has become the most popular choice among container orchestrators with strong community and growing numbers of production deployments. There is no shortage of various K8s distros, at the moment 20+ and counting. There are many distributions available that just simply add toolsets and products that embed it and adds more features. In this presentation, you’ll learn about OpenShift and how it compares to vanilla Kubernetes – their major differences, best features and how they can help to build a consistent platform for Dev and Ops cooperation.

Tomasz Cholewa Architect of modern infrastructure based on cloud, containers and Linux systems. “Infrastructure as Code” practitioner, enthusiast of real DevOps based on culture rather on tools and zealous automation fan. He acquires new skills with passion and often with a new certificate added to his collection. He likes to share his knowledge with others and utilizes it by designing effective environments for modern applications. Fan of absurd humor and an incurable perfectionist who always sees a room for improvement.

DICE: Quality-Aware DevOps For Big Data Applications

Giuliano Casale (Imperial College London – as Coordinator of the DICE Consortium)

Eclipse is a powerful solution for software development, a reference in the field. Why not enrich its Model-Driven facilities to architect and analyse Big Data applications with quality metrics? What is more, why not adding to the Eclipse foundation instruments for software Operation as well, as opposed to software development? This talk/demo tells the story of how we addressed the two questions above within the DICE R&D collaboration

Show more...

We successfully integrated the Eclipse IDE with model-driven facilities for architecting, analysing, and operating Big Data applications, enhancing the standard Eclipse IDE outfit with our own analysis tools, specific UML profiles, as well as an orchestration engine based on Cloudify capable of deploying Big Data Application blueprints specified in standard OASIS TOSCA language on multiple cloud infrastructures, such as Amazon, FlexiOPS, and OpenStack. The talk will introduce the main features of the DICE Eclipse-based framework and explain how it can help developing Big data applications in a DevOps fashion, taking into account quality constraints. The DICE framework is based on 15 tools, built on top of the DICE IDE, and of a UML profile to model technologies such as Apache MapReduce, Spark, Storm, and several others. The remaining tools add great potential to any Eclipse IDE and are respectively for automated deployment, simulation, optimization, verification, monitoring, anomaly detection, trace checking, enhancement, quality testing, configuration optimization, fault injection, repository management and continuous integration. DICE has been applied to industrial case studies, such as in the development of data-intensive IoT application for port operations, which will be demonstrated in this talk.

Giuliano Casale received the Ph.D. degree in Computer Engineering from Politecnico di Milano, Italy, in 2006. In 2010 he joined the Department of Computing at Imperial College London, UK, where is currently a Senior Lecturer in modeling and simulation. Previously, he worked as a scientist at SAP Research UK and as a consultant in the capacity planning industry. He teaches and does research in performance engineering, cloud computing, and Big data, topics on which he has published more than 100 refereed papers. He has served in the technical program committee of over 80 conferences and workshops and as general or program co-chair for conferences in the area of performance engineering including SIGMETRICS/Performance, MASCOTS, QEST, Valuetools, ICPE, and ICAC. He is the recipient of several awards, including the best paper award at ACM SIGMETRICS 2017. He is a member of the IFIP WG 7.3 group on Computer Performance Analysis and since 2015 serves in the ACM SIGMETRICS Board of Directors.

Ignite Talk: #Monitoringlove In 2017

Tomasz Tarczyński (Gigaset)

In 2011 the #monitoringsucks discussion started between system engineers. Now after 6 years this area has changed a lot. Today most fast-pace organizations have modernized their approach to monitoring. We probe in high frequency, we made the monitoring systems distributed and use automation (configuration management) to control all parts of the monitoring stack. And our tools have changed a lot. They are now specialized for a given task, allow usage of exchangeable components, are easily interconnected and allow rapid development in a self-service model. In 2017 changes in monitoring are done through API / config management and not by filing a ticket. I’m mainly interested in OSS monitoring tools, so the ignite will show how exciting it is to build a monitoring platform with modern tools that can be used as a service by the whole organization.

Tomasz Tarczyński: Systems architect and Ops technical lead at Gigaset. Focused on shortening delivery times with adoption of DevOps culture, automation and measurements. Encourages sharing of knowledge and experience both within and outside of the organization. Especially excited by building modern monitoring. Organizer of the DevOps Wroclaw Meetup group.

Ignite Talk: DevOps In A Serverless World

Serhat Can (OpsGenie)

Serhat is a Software Engineer and a new DevOps Evangelist for OpsGenie. He contributed to different parts of OpsGenie as an engineer and now tries to spread the word by coding, writing and talking about DevOps. He is still a proud member of the on-call schedules.

Infrastructure As Code – Lessons Learned

Kamil Szczygieł (Container Labs)

Defining infrastructure as code has grown massively in popularity. Due to demand, a large set of tools has emerged that try to address this approach requirements. Because no tool is perfect, there are always caveats that DevOps will encounter during his journey. This talk will walk the audience through real-life use cases, faced problems and possible solutions while working with infrastructure as code solutions such as Terraform and SaltStack. It will also include a brief overview of tools that attendees may find helpful in their regular day 2 operations using infrastructure as code. Kamil is a DevOps Architect at Container Labs where he helps to build fault-tolerant and scalable solutions based on Mesos/Kubernetes. Previously he worked on delivering bare metal, automated VMware stacks to enterprise customers and improving the performance/scalability of large Software Defined infrastructures. He is an automation freak (if it can be automated, it should be automated!), new technologies enthusiast and DevOps attitude evangelist.

Is #NoOps The New Evil? Or Is It Just „I Don’t Care“-Ops?

Bastian Widmer (Amazee.Io)

Are you looking for an easy way to move your apps from dev to production without hiccups? Would you like to easily spin up staging environment similar to production one only when it is needed? Maybe you want to apply security patches on thousands of your VMs or just do a blue/green deployment of your newest docker container? Finally maybe you just want to find out how “Infrastructure as Code” looks in practice and how it can be automated with CI pipeline? Curious? Come and see how we applied Ansible to almost every part of our job and made our lives better! PS. Do you know how to make tea or coffee with Ansible? We’re looking for solutions…

Bastian works as an Engineer at the Startup, where we strive to make hosting the open source content management system, Drupal, as easy as possible. The nature of his internet-centric position makes him very aware that we need to safeguard the openness and neutrality of the internet on a global scale. This is also why he engages in several groups in Switzerland to work on issues related to net neutrality, security, and surveillance. Outside of his job, Bastian dedicates much of his time to organizing various conferences and gatherings such as TEDxBern, DevOpsDay Zurich, DrupalCon. Bastian is passionate about Architecture, Circular Economy, the future of work, digital revolution and the impact on society itself. When he is not doing work or projects that are closely related to computers, he is very passionate about nature such as hiking and piloting his paraglider through the swiss alps. Apart from that, he likes to travel around the world to meet likeminded people learning how their communities are different to others and how we can learn and adapt from each other.

Microservices And Devops In The Cloud

Tomasz Stachlewski (AWS)

Learn about Amazon’s transition to a service-oriented architecture and devops culture over a decade ago. During this session we will describe how Amazon moved to microservices approach of building applications, how it created the concept of two pizzas teams which build services for Amazon Web Service (AWS) cloud. We will also cover ways how other companies are using cloud to transform their self to create applications faster.

Tomasz is a Senior Solutions Architect at Amazon Web Services, where he helps companies of all sizes (from startups to enterprises) in their Cloud journey. He provides guidelines for creating cloud solutions that deliver the most value to his customers, and help take their IT to the next level. He is a big believer in innovative technology such as serverless architecture, which allows organizations to accelerate their digital transformation. Before joining Amazon, he worked at LOT Polish Airlines, where he architected their first cloud projects, and at Accenture.

KEYNOTE: Docker, Moby Is Killing Your #Devops Efforts

Kris Buytaert (Inuits)

Kris Buytaert is a long time Linux and Open Source Consultant. He’s one of instigators of the devops movement, currently working for Inuits He is frequently speaking at, or organizing different international conferences and has written about the same subjects in different Books, Papers and Articles. He spends most of his time working on bridging the gap between developers and operations with a strong focus on High Availability, Scalability, Virtualisation and Large Infrastructure Management projects hence trying to build infrastructures that can survive the 10th floor test, better known today as the cloud while actively promoting the devops idea! His blog titled “Everything is a Freaking DNS Problem” can be found at

DevOps For Application Security

Krzysztof Kopera (Intelligent Services)

DevOps4AppSecurity – In order not to improve security after the development process, we adapt to DevSecOps, which supports the planning and control of security from the beginning of development and the operation. As a result, we expect that safety oversight is the work of all involved in the development process, and safety is designed and tested on a regular basis. In the presentation we will focus on DevSecOps transformation challenges, we will also analyze real cases where the DevSecOps approach could be an effective remedy for security threats. Topics include: 1.How the application deployment changed with security in mind: DevOps, security checking automation. 2.Application security and techniques for getting more from application testing. 3.How enterprises are leveraging DevOps to deliver more secure applications faster.

About Krzysztof: Agile and DevOps Consultant, solution architect and project leader of the development and integration of information systems for the telecommunications industry, insurance and medical services. Expert in business and technical analysis, DevOps, Process Automation, and Application Security. Experienced in development and delivery of business and network support systems for telecommunication, medical imaging solutions for healthcare, and marketing automation for mobile network operators. Privately husband, father of 11 years boy and 8 years girl. Road and x-cross cyclist for hobby.

Ignite Talk: DevOps starts in the kindergarten

Rafał Malanij

Microservices Meetup Munich (Munich, Germany)

All Roads Lead to DevOps

Erik Dörnenburg

Today it is hard to imagine that fifteen years ago agile development was a niche approach, considered too radical to be used in the mainstream. Similarly, when the DevOps movement started about five years ago only a small number of innovative organisations took note. They quickly gained competitive advantages, which then led to more and more interest in the movement. Erik will talk about experiences organisations have made with DevOps. He will discuss processes, tools, and organisational structures that led to the successful merging of development and operations capabilities, and he will describe how DevOps fits with other trends such as Microservices and Public Clouds. All of this forms a picture that allows only one conclusion: sooner or later all successful organisations will move to a DevOps model.

Erik Dörnenburg is a software developer, consultant, and Head of Technology at ThoughtWorks, where he helps clients with writing custom software. Over the years Erik has worked with many different technologies and technology platforms, always curious to understand the potential they offer to solve real-world problems. Having seen a fair share of overly complex architectures he became interested in exploring simplicity in design and software visualisation as a way to make architecture more understandable. Erik’s career began in the early nineties and throughout he has been an advocate of agile values and open source software. Over the past ten years he has spoken at many international conferences, contributed to a few books, and maintained several open source projects. He holds a degree in Informatics from the University of Dortmund and has studied Computer Science and Linguistics at University College Dublin.


Leave a Reply

Your email address will not be published. Required fields are marked *