Taking System Monitoring to the Subsequent Stage: an Interview wi…


As computing ecosystems become more complex, monitoring and
analyzing those often disconnected moving parts becomes increasingly
challenging.

Today’s data center has evolved from a
single supplier producing and selling all-in-one offerings, such as
the days when EMC, NetApp, HP or even Sun owned your data center and
you chose a vendor and stuck with it. Those same vendors provided you
with the required tools to monitor, analyze and troubleshoot their entire
stack.

Shifting focus to the present, the landscape
now appears to be quite different. Instead, you will find environments
of mixed offerings provided by an assortment of vendors, both large and
small. Proprietary machines work side by side with off-the-shelf commodity
devices hosting software-defined software. Half of your applications
may be hosted in virtual machines over a hypervisor or just spun up
in a container. How does a modern data-center administrator or DevOps
professional manage such an environment?

An assortment of platforms and frameworks exist that provide
such capabilities, but they’re not all one and the same. In
some cases, those same tools will need to be coupled with others to
produce something useful (for example, ELK: Elasticsearch + Logstash +
Kibana). Unfortunately, this arrangement just adds to the complication and frustration
when attempting to diagnose or discover problems in your computing
environment.

Putting an end to this level of complexity, one company
stands out among the rest: Scalyr. Scalyr develops and offers a
complete suite of server monitoring, log management, visualization and
analysis tools, which integrate with cloud services. I recently had
the pleasure of chatting with Scalyr CEO Steve Newman.

His is not a household name, like Steve Jobs or Bill Gates, but you
will be familiar with his work and contributions to cloud-enabled
technologies. Although this is likely to change with Scalyr, Steve is best
known for his work with Writely, a technology that later was acquired by
Google and relabeled as Google Docs. In our conversation, Steve and I took
the opportunity to discuss Scalyr, its solution and the problem it solves.

Steve Newman, Scalyr CEO

Steve Newman, Scalyr CEO

Petros Koutoupis: Tell me a bit about yourself. Who is Steve Newman?

Steve Newman: I am an engineer by both training and background
and have spent most of my career in the startup environment. This is
because I enjoy building things. I was at Google for a number of years
following an acquisition, and while the experience itself was great,
the startup bug in me drove me to Scalyr.

PK: So, now you founded a company called Scalyr. Please
tell us, what is Scalyr?

SN: Scalyr is a log management platform for engineers
responsible for software development. We collect logs from applications,
services, containers and systems, and make that data available to help
engineering teams track down problems and generally manage the complexity
of modern development and operations.

PK: And why? What problem(s) does your product solve?

SN: Traditional log management tools are complex
and often very slow at scale. This leads to a “gatekeeper” approach
to log management, where only a select few acquire the expertise (and
have the patience!) to access this critical data. Logs become a tool of
last resort, hindering the team’s ability to rapidly or proactively
address issues.

My co-founder Steven Czerwinski and I first experienced this problem
ourselves at Google. We were leading an infrastructure project supporting
Google Docs, Drive, Photos and other related applications. There were a
lot of moving parts, and the engineering team spent close to half of its
time simply tracking down problems. We started Scalyr in 2011 to create
the tool we wished we’d had at Google—one that would allow us to
make sense of the flood of telemetry data and quickly understand why a
complex system is misbehaving.

PK: How does Scalyr work?

SN: We’re a fully managed SaaS solution. Logs are
sent to our centrally hosted search cluster, using our agent or an array
of API integrations. Engineers then use our web app (or APIs) to analyze,
visualize and explore the logs.

The critical component is the back end. We’ve built the back-end software
stack from scratch, optimizing for the data access patterns that arise
in log management. Some interesting aspects of our approach are:

1) Unlike other log management solutions, we don’t use indexes. Keyword
indexes are optimized for finding the “best ten matches”, in a corpus
comprised of slowly evolving, human language text such as web pages or
product descriptions. Log management use cases are very different, with
small units of text (individual log messages), constantly updated, full
of record IDs and other non-words that balloon the vocabulary size. Most
important, log management queries generally visualize a complete data
set, rather than stop after ten high-ranking results. Keyword indexes
don’t help much there, and they are complex, expensive to maintain
and often impose multi-minute ingestion delays.

We’ve taken a much simpler approach, building a streamlined, columnar
data store that’s optimized for log data. The basic idea is that we
just store logs and scan them, like good old-fashioned grep. We then use
a lot of tricks to minimize the amount of data that needs to be scanned;
for instance, when querying specific fields of a log, the columnar data
layout means that we need to scan only those fields.

2) We process queries one at a time, globally. This allows each customer
to use our entire search cluster, with aggregate search performance of
1.5 terabytes per second. It’s fast enough (96% of queries complete
in less than one second) that queries almost never wait in line—we
finish each query before the next one arrives. The nice thing about
this approach is that there’s an economy of scale: as our customer
base grows, performance increases.

3) We’ve built a separate back end for repetitive queries, such as
those used in dashboards or alerting rules. This part of the system
is based on a time series database, with a custom time series for each
query. The ingestion engine automatically updates these time series as
log messages arrive. This means we don’t need to execute queries to
display a dashboard or evaluate an alerting rule—the relevant data
has been precomputed in the time series. In database terms, we’re
automatically creating materialized views where needed.

PK: What makes Scalyr stand out or competitive with
existing solutions?

SN: Speed, simplicity and scalability.

Speed was central to our mission from day one. Logs are a massively
useful, detailed data source, but when it takes minutes (or longer)
to run a query, engineers avoid using them. We satisfy most queries in
a fraction of a second. We also ingest data in real time: new logs are
available for querying within a few seconds.

Simplicity goes hand in hand with speed, which is best measured as
the time from a question in an engineer’s head to an answer on
his or her screen. The fastest back end in the world is of little use if
you spent five minutes wrestling with the query language. We rely
on our performance, as well as our ability to parse logs and extract
structured data on ingestion, to provide a set of visual exploration
tools that allows engineers to get answers without becoming query language
experts. There’s a query language, but you can dive in without knowing
anything about it.

Finally, customers often choose us for our scalability. We continue to
work well not only as server count and data volume increase, but as a
team grows. Scalyr works just as quickly whether it’s three or 1,000
people looking at logs. This helps teams move away from the gatekeeper
model of traditional log management to the concurrent, collaborative
engineering model that modern organizations are increasingly adopting.

With Scalyr, companies for the first time have a log management platform
that does not charge ruthless licensing fees for new users, involve long
ramp-up times or require learning arcane query languages. It’s meant
for teams who need to move fast.

PK: Who would benefit from using Scalyr?

SN: Our sweet spot is organizations where the
application is critical to the business. From B2B software to online
retailers, dating platforms and media companies, every business’
competitive advantage increasingly is the technology stack and the speed
at which that stack can evolve. Scalyr is critical to enabling that. Some
of our customers include NBCUniversal, OkCupid, Zalando and ReturnPath.

Within those organizations, the primary Scalyr user usually comes from
engineering or DevOps. But Scalyr is simple enough to use that we often
see usage spread to other roles like support, which can search logs to
track down specific client problems. Some of our customers have upwards of
1000 individuals with Scalyr logins. Typically, half the users are active
on a weekly basis, which is huge engagement compared to traditional log
management platforms.

PK: How easily does Scalyr integrate into current
production environments?

SN: It’s our mission to meet customers where they are,
so we support many different models. Some run their own servers or virtual
servers. Some are on Kubernetes, while others are serverless. Regardless,
setup first involves retrieving logs. The most common way of doing this
is with a lightweight agent that customers can install as a container,
sidecar or whatever they need. We also have API integrations to retrieve
logs directly from the wide array of cloud services in use today.

Once the logs are flowing in, you’re off and running, but an
important further step is to set up parsing rules. This allows us to
extract structured data, unlocking the full power of the analysis and
visualization tools. To make this as easy as possible, we’ve built
three generations of parsing engines. The current engine is so easy to
use, we’ve literally put a button in the product that tells our
support team to set up the rules for you. Of course, being engineers,
many of our customers prefer to do it themselves.

Conclusion

With today’s internet, the “app” increasingly is
the business (think of Uber, Airbnb, Amazon and so on), and getting to
the bottom of downtime is crucial. This process is made more difficult
than ever as the system or code is increasingly distributed with the use
of containers, serverless and other technologies. That is where Scalyr
comes in with its log analysis platform. It is crazy fast and easy to
use. To learn more about this wonderful product, visit https://www.scalyr.com.



Source link

We will be happy to hear your thoughts

Leave a reply