OP2

OP2

Executive Summary

Our GraphQL service - FireFly functions as Amazon Music’s primary general purpose API aggregator, serving 1P, 2P, and 3P clients. It provides a consistent unified data model across internal Music services, achieves fast roundtrip resolution via caching strategies, and implements built-in optimization on queries, by fetching only the specific data requested, reducing unnecessary data transfer. The 2025 roadmap focuses on enhancing developer experience, improving performance, and expanding service capabilities to meet evolving business needs. Our investments are strategically allocated across building core capabilities/ primitives (36.5%), improving service performance, stability & reliability (37.8%), developing operational excellence (14.5%), and adhering to non-negotiable compliance and legal requirements (11.2%).

2024 Recap

In 2024 Firefly contributed significantly to various key projects across Amazon Music by building core capabilities such as adding new entities support for Audiobooks, Fan Groups, Insights, Subscription, Merch. etc., supporting key functionalities around live session chat, anonymous authentication etc. and improving overall stability & performance through cache optimizations, better error handling, pipeline improvements and intelligent Service timeout configurations. FireFly currently has integrations with 60+ data providers supporting 450+ distinct fields for query & mutations and supports 35+ internal clients including DragonFly, Skyfire, native module apps, etc. Overall, with over ~110 CDK+ commits to infrastructure and ~400+ to core, Firefly was able to complete 28+ schema updates as well as address 270+ external asks with a success rate of over 80% in 2024. At the same time, we made several improvements to Firefly’s operational efficiency, flexibility and infra upgrades by moving to ECS, improving ElastiCache cluster resiliency, reducing rollback time to 20 minutes or less, and speeding up deployment with a higher compute Gitlab runner fleet.

Developer Experience Q4 Survey feedback

In January 2025, CXI’s Developer Experience team conducted a comprehensive survey to assess our development ecosystem and gather feedback on specific Infrastructure products, including FireFly. The survey revealed a CSAT score of 4.06 (on a 7-point scale) for FireFly, with 20% of respondents (n=50) expressing slight or strong satisfaction. However, 60% of users reported requiring moderate to extensive support while using the service. Key challenges identified included insufficient documentation on concepts, tenets, guidelines, best practices, and engagement models (52%), prolonged away team code review timelines (52%), and difficulties in launching short-term experimental CX without undergoing the full review, schema design, and implementation process (50%). Developers suggested improvements such as better observability with seamless end-to-end trace profiling (60%), unified documentation paired with a developer console and schema exploration tools (50%), and simplified schema experimentation with lower barriers (44%). These insights have significantly informed our 2025 roadmap, guiding our focus on addressing pain points and implementing suggested improvements. For a complete list of identified issues and proposed areas of improvements, please refer to Appendix D.

Vision & Strategy

Our three-year vision aims to establish FireFly as the cornerstone API aggregator in Amazon’s audio streaming ecosystem. By 2025, we will execute our strategy across the four key themes outlined in the executive summary. We’ll focus on significantly reducing query response times and improving uptime (99.99%) through infrastructure migration to ECS and advanced caching mechanisms. Our API ecosystem will expand to include real-time streaming capabilities and integration with numerous new downstream services. We’ll drive increased developer adoption by enhancing documentation, tools, and simplifying schema experimentation. Operational excellence will be reinforced through proactive infrastructure upgrades and advanced security implementations. This comprehensive approach addresses immediate performance and reliability concerns while setting the foundation for our long-term goal of positioning our GraphQL service as the universal data access layer and industry benchmark for large-scale audio streaming platforms.

Tenets (unless you know better ones)

The following tenets serve as guiding principles for FireFly, shaping our decision-making process and ensuring alignment with our overall vision and strategy as we evolve and improve our API platform: -

  • Developer-First: Craft an intuitive, self-documenting GraphQL schema that accelerates product innovation for all customers.

  • Multi-Use Data Priority: Consolidate multi-feature/ use-case data while judiciously incorporating essential single-consumer fields.

  • Schema Stability: Embrace proven, long-term attributes to maintain a clean and reliable schema.

  • Ownership and Governance: The FireFly team stewards the schema, ensuring backward compatibility, proper versioning, and future-ready composability.

  • Data Integrity: Prioritize direct data access, replicating only when performance or reliability gains are substantial.

  • Enterprise-Grade: Uphold Tier-1 reliability and security standards with strict SLAs, comprehensive monitoring, and robust access controls.

  • Merit-Driven Adoption: Offer a compelling, paved-path solution while respecting customer choice in technology decisions.

2025 Investment Pillars and Initiatives

1. Core Capabilities (36.5% resource allocation)

As of 2024, FireFly has developed a data graph that covers essential Music, Podcast and Audiobook entities and their relationships. Our APIs provide access to Music and Podcasts catalog data, along with primitives reflecting users’ activities and taste signals such as like/follow status and listening histories. We also offer Playback, Search, and Recommendation APIs for better accessing these primitives. We also added support for Audiobook as an entity as part of project Montana and currently support a number of use cases around audiobook discovery and recommendations.

In 2025, we aim to expand our core capabilities, enhancing the service’s versatility and power. This theme focuses on developing new primitives and extending existing ones to meet evolving business needs and technological advancements. Few of the key initiatives under this theme include: -

1.1. Onboarding new ‘entities’ and expanding functionality for existing entities

FireFly would be adding support for ‘concerts’ as a new entity that will enable experience teams to build fandom forward experiences such as following a concert/ live events, concert recommendations, ticket purchase etc. To support ‘Blackbolt’ - an innovative AI-driven audio App, FireFly would be adding ‘collections’ as a new entity to power its CX. We will contribute towards expanding the supported functionality around ‘fangroups’ which is a strong pillar behind Amazon Music’s fandom strategy (2025 S-Team goal) by enabling follow, block profile, UGC ban status, attachment support in fan group message and supporting thumbnail/ banner images. Further, in order to power year round insight CX for customers we would augment the ‘InsightsHub’ API as part of FireFly. Most of these capabilities would be developed using the FireFly away team contribution model (FAQ 5) with the core FF team supporting with schema reviews, feedbacks and MR reviews to maintain the scalable graph architecture.

1.2. Subscriptions and real-time updates

We will implement GraphQL subscriptions to enable real-time data updates for our clients. This feature allows clients to subscribe to specific events or data changes, receiving instant updates without polling. Subscriptions will improve our service with real-time data synchronization, reduced network overhead, and improved user experience for live updates unlocking multiple use cases around Casting, collaborative playlists, live events updates and several Fandom initiatives.

1.3. Customer benefits management

As Amazon Music has evolved, the tier-based customer’s benefit-vending logic has become increasingly complex with the addition of multiple use cases where the benefits are dependent on customer context such as profile (project Montana), territory (project Geet), authentication status (project Casper), device (project Canary) etc. This has also proliferated into FireFly which has to maintain multiple hard-coded logics to vend benefit information to clients. Stratus, owned by Maple Identity org is working on evolving the benefits platform (proposal) to support ever-evolving use cases and we are collaborating closely with them to model it in FireFly that not only unlocks multiple use cases for our customers but also simplifies our tech stack resulting in better performance & lower cost. We are also working closely with the FMPM team to consolidate the ‘dynamic entitlement’ benefits currently handled by the MDEX (MusicDynamicExperience) service owned by them.

1.4. Expanding Audiobook support

In 2025, in order to strive towards Amazon Music’s “All Audio” goal pursuit, we plan to expand on the currently supported APIs for Audiobooks by including support for recommendations based on category, listening history, trends, new releases and exclusives. We are also exploring integrating the ‘Playback’ APIs for audiobooks including retrieving the licenses, manifests, and playback assets. Although this is currently blocked on the Audible team to share the relevant API details with FireFly to onboard. (Refer FAQ 8). These additional capabilities would power multiple use cases across our 1P,2P & 3P clients.

1.5. LLM based FireFly Assistant

Lack of unified documentation, onboarding support and schema exploration tools was reported to be one of the biggest areas of improvement for FireFly during the Q4’24 DevEx survey. Developers reported spending longer time onboarding onto FireFly as well as struggling with the poor performance of their production GraphQL queries. In order to address this, in 2024 we developed FireFly Assistant - an LLM-based assistant (runner-up idea during the Q4’24 MCX hackathon) serving the dual purpose of an interactive tutor for developers learning GraphQL as well as assistant aiding experience developers optimize their queries for better performance. This tool analyzes queries in real-time, providing suggestions and explanations tailored to the problem space. We would expand this tool in 2025 by integrating it in developer’s IDE to provide contextual suggestions.

2. Performance, Stability and Reliability (37.5% resource allocation)

Our GraphQL API’s performance has been a significant concern for our customers. While much of the latency is attributed to downstream services and sequential request chaining, we are committed to addressing this issue from the platform perspective. We will implement a programmatic approach with these downstream systems to optimize for latency along with making improvements to FireFly Infrastructure. Our focus in 2025 is on five key initiatives: -

2.1. Observability Improvements

We will develop a query-specific latency dashboard and include complete server-side latency measurements in addition to query duration that will provide better insights behind individual query’s performance. We will also add metrics and alarms for each integration enabling better understanding of performance bottlenecks resulting in faster issue resolution.

2.2. GraphQL Core Infrastructure Migration

We would complete the migration of our GraphQL core infrastructure to Amazon ECS (Elastic Container Service) as part of project MetalFly - Phase 2 by Q2’25. This migration is expected to improve performance and avoid the issues seen with AWS Lambda related to cold start and connections to Redis cluster.

2.3. Request Handling and Resource Management Enhancements

We will implement batching support, introduce service & API specific timeouts and optimize data loaders for improved performance. We will also integrate with BMC (Beyond Music Catalog) to fetch podcast and audiobooks specific content. These optimizations aim to reduce overall response times and manage resource allocation more effectively.

2.4. Caching Improvements

We will apply aggressive caching strategies along with “edge caching” capabilities to speed up data retrieval and significantly reduce response times for frequently requested data. We will also explore pre-caching the Catalog and rely on the Catalog event stream for cache invalidation, allowing for real-time updates to cached data. This will ensure that clients always receive the most up-to-date information without sacrificing performance, striking a balance between data freshness and query speed.

2.5. Improved service predictability & reliability

We will work across various upstream teams to have a consistent pagination across all queries, introduce per-query limit support and audit null responses which will improve data handling efficiency giving clients more predictable control over the data they query.

3. Operational Excellence and Infrastructure Resilience (14.5% resource allocation)

In 2025, we’re prioritizing operational excellence to maintain our GraphQL service as a robust and cutting-edge solution. This theme focuses on enhancing infrastructure, improving reliability, and staying ahead of technological advancements. These investments aim to support our growing client base and maintain our competitive edge. Importantly, these efforts will reduce our long-term bandwidth allocation for Oncall and KTLO support, allowing us to focus more resources on innovation and new feature development. Few of the key initiatives under this theme include:

3.1. Tier-1 Service Reliability

We’re committed to elevating FireFly Tier-1 status further, targeting 99.99% uptime from our current 99.45% availability. This involves improving our monitoring and alerting, standing up fallback mechanisms, and lowering our speed to recovery.

3.2. Agentic Oncall support

We will implement an AI-powered assistant to enhance operational efficiency. This tool will streamline incident response by automatically collecting relevant logs, analyzing issues, and providing actionable insights to Oncall engineers, reducing mean time to resolution and improving overall service reliability.

4. Non-Negotiables (11.5% resource allocation)

4.1. Region Flex support

We would migrate FireFly and MESK infrastructure from DUB to ZAZ as part of the Music wide Region Flex program. We are working closely with the central Region Flex team and exploring usage of Amazon’s IronHide tool, which supports custom workflows to transform code using deterministic algorithms and LLMs.

4.2. Security & Compliance

We would continue to address any security and compliance specific risks through initiatives like upgrading generalized platform-level metrics package for Skyfire to be DMA (EU Digitals Market Act) compliant, updating the pbkdf2 algorithm used for encrypting and decrypting profileId and adding RED data compliance for sensitive customer fields in FireFly APIs.

FAQs

1. What are the key Metrics & KPIs that we track for measuring FireFly’s success?

While our primary indicator of success for FireFly is the Customer Satisfaction Score (CSAT), measured quarterly on a scale of 1-7. This metric encapsulates the overall health and usefulness of our GraphQL service for our developers. The Q4’24 DSAT score stood at 4.06/7 with our target being a score of 5.5 or higher. This CSAT score reflects developer sentiment on API performance, documentation quality, ease of integration, and overall experience. Along with this we also measure below key supporting metrics: -

  • Performance Metrics

    • Request Rate: Queries and mutations per minute

    • Latency: Median and Trimmed Mean (TM95) response times

    • Error Rates: HTTP and GraphQL errors (4xx,5xx)

  • Operational Metrics

    • Cache Hit Rate: Effectiveness of caching strategy

    • Resolver Performance: Identifying bottlenecks

    • Subscription Notification Rate: Frequency of real-time updates

  • User Experience Metrics

    • Operation Complexity: Depth and breadth of queries

    • Field Usage: Most and least used fields

    • Client Versions: Adoption rates across different clients

  • Business Impact Metrics

    • API Uptime: Service availability

    • Time to Market: Speed of new feature deployments

    • Developer Productivity: Reduction in API development time

We continuously monitor these metrics using our observability tools, allowing us to proactively address issues, optimize our service, and demonstrate FireFly’s value to stakeholders.

2. What are some of the key quarter-wise product deliveries from FireFly in 2025?

Below are the key quarter-wise deliveries from FireFly: -

Quarter

Core Capabilties

Enabling stakeholder’s deliveries

Q1’25

Maestro prompt history support

Short Loopig Video visualizer support in FireFly

Media central integration

Collections entity support for project BlackBolt

Identity support for Merlok devices (fire kids tablet)

Key functionalities support for ‘FanGroups’ entity ( thumbnail, banner images, bock profile, UGC ban status etc.)

[S&P] Improved Observability: e2e latency metrics breakdown

‘Customer insights’ modeling to support Fandom CX

[Operational Excellence] GRS (GothamRatingsService) deprecation support

GRS deprecation’Evergreen

Polls’Artist ↔ Merch relationship support

Q2’25

[Non Negotiable] RegionFlex Support - FireFly Migration

‘Follow’ functionality support for Concerts entity

[Non Negotiable] RegionFlex Support - MESK Migration (Infra)

‘keymasters’ support in FireFly

Account creation (project Casper, Quattrroo)

Related playlists & Stations resolvers

Benefits management (Project Geet, GH)

‘Nimbly’ deprecation support

MetalFly - Phase 2 [S&P]

Anonymous access to Playlist (OGRE),

[S&P] Consistent error handling and gafana dashboard improvements

top content queries

[S&P] Reducing Query complexities for Phoneix / 3P APIs

Stations Charts track ranking movement

Q3’25

GraphQL Subscriptions

Custom artwork for playlists RED data compliance for cancellation flow

Audiobook playback capability

FireFly platform latency optimizations including edge caching

Internal Infra caching improvements

[S&P] Adding metrics and corresponding alarms to each integration

Infra Upgrades (node.js, javascripts, Apollo gql)

Q4’25

Consistent pagination support

‘Per query rate limit’ support for various Music entitites

Red-Anvil recertification

FireFly tier-1 service goal (99.99% availability)

Throughout 2025, FireFly has focused on delivering core capabilities, performance improvements, and enabling key stakeholder initiatives. These deliveries would significantly strengthen FireFly’s functionality, performance, and integration capabilities, providing substantial value to both internal teams and external partners.

3. How will you ensure backward compatibility with these changes/ launches?

GraphQL ensures backward compatibility through additive changes only. New fields are made nullable with default values, while existing fields are never removed, only deprecated. This approach, combined with careful schema design and validation, allows the service to evolve without breaking older clients.

4. How does FireFly’s annual roadmap planning accommodate clients who follow quarterly planning cycles?

While we maintain an annual roadmap, we’ve designed our planning process to be both strategic and flexible. Our annual plan is based on long-term visibility of customer requirements (collected through the customer outreach mechanism held in jan’25) and organizational goals, allowing us to focus resources on core platform improvements that deliver significant value. However, we recognize the dynamic nature of our clients’ needs. We remain adaptable to priority shifts throughout the year, evaluating high-impact requests on a case-by-case basis and making informed trade-off decisions when necessary. This approach allows us to balance long-term strategic initiatives with the ability to respond to emerging client needs, ensuring FireFly continues to evolve in alignment with both overarching organizational goals and specific client priorities.

5. What is FireFly’s new away team engagement model?

FireFly employs an empowering away team engagement model, inspired by the established Amazon stores away team engagement model with slight variation as documented here, balancing client self-service with graph quality maintenance. We provide comprehensive documentation and tools for minor changes (example adding new enums to existing attributes in the graph, renaming attribute etc.) to the Graph, while for complex integrations, a client developer joins our core team temporarily. This away team member attends regular standup, works closely with our experts, and completes GraphQL training to upgrade their skills. They contribute to improving the overall graph while gaining deep insights into our framework. This approach ensures clients can efficiently meet their needs, maintains graph quality, and fosters a collaborative environment that benefits all users.

6. Why can’t all client requests be handled through the away team model without involvement from the core platform team?

While we strive to empower our clients, maintaining the quality, efficiency, and scalability of our GraphQL framework is paramount. All schema changes undergo a thorough review process by our core team to ensure architectural integrity and avoid inefficiencies that could impact all users. Due to limited bandwidth (each schema change on an average requires ~2 weeks of effort from the core team around schema design review, MR review etc.), we prioritize a select number of client requests each quarter. However, we remain open to discussing high-priority requests with strong business justification. These cases can be evaluated individually, involving leadership when necessary. This approach allows us to balance client needs with the overall health and evolution of our GraphQL ecosystem, ensuring it continues to benefit all customers effectively.

7. How is FireFly supporting Project GreenHornet?

FireFly is supporting the GreenHornet initiative in 2 major ways: We are addressing feature gaps that have been identified in the implementation, adding data to allow clients to exclusively rely on FireFly for all experiences. Additionally, we are working closely with the GreenHornet client team to improve performance and developer experience of the FireFly service. We are focusing improvements to latency, error messaging, and capability/benefit vending.

8. What are some of the key initiatives which are ‘unfunded’ and BTL (Below the line) currently?

Due to the resourcing constraints and available bandwidth, we are unable to fund below key capabilities as well as the away team support needed by our clients.

Unfunded FireFly core capabilities

Unfunded Away team support

Support Cache Directives

Maestro Explainability feature

Supporting Cost Directive

Maestro playlist refinemenets support

Project Oracle (support P13N APIs in FireFly)

Podcast Bites support

‘Defer’ Functionality support

Quattro support

FireFly Official sdk

X-Ray metrics support for FactRanker

User state data support on Podcast entity

Podcast Bites entity support

Quattro core CX support

Library: Breakout Playlist Widget for Free/Prime

FireFly Onboarding for MIAFS (IAFT)

11. What are the key discussion topics / leadership asks?

Below are few of the topics we need leadership guidance on: -

  1. Funding Ask: Several critical initiatives (Refer FAQ 8 above), including core FireFly capabilities (Defer, Official Sdk, cache directives etc.) and stakeholder support through away team engagements (example Maestro use cases, 3P and Quattro support, Podcast support etc.), are currently ‘below the line’ on our roadmap due to capacity constraints. We’re requesting funding for two additional headcounts for the FireFly team to address these gaps. This investment would accelerate feature development, enhance stakeholder support, and improve our overall time-to-market. It would allow us to balance resource allocation, moving previously unfunded initiatives into active development while maintaining progress on current priorities.

  2. Audible API access: We aim to integrate Audiobook playback support into FireFly, a strategic initiative that would significantly benefit various customer experience use cases across our 1P, 2P and 3P clients. However, we’ve encountered challenges in obtaining access to the necessary Audiobook playback, recommendation and borrow specific APIs, despite reaching out to the Audible team on several occasions. Given the strategic importance of this capability and its potential to deliver substantial value to our stakeholders, we’re seeking leadership intervention to unblock this initiative.

  3. Otel (Open telemetry) adoption: Our Q4’24 developer experience survey and project ‘Heron’ have highlighted stability and performance as critical pain points. While FireFly is investing in improving end-to-end latency observability for server-side operations, we believe significant benefits can be unlocked by extending OpenTelemetry (OTel) implementation to the client side. This comprehensive approach would provide holistic performance insights, enable proactive issue resolution, and ultimately upgrade customer experience. However, client-side implementation remains unclear and unfunded. We urge leadership to prioritize and fund full-stack OTel adoption (or other available options like Bugsnag performance sdk), including client-side instrumentation, to achieve end-to-end observability. This investment will not only address current performance challenges but also future-proof our architecture, facilitating data-driven decision-making and more efficient problem resolution across our entire application stack. We are working closely with the regionflex central team to understand the scope of end-to-end telemetry support they are proposing for AM.

Appendices

Appendix B: FireFly - Pain points and Areas of Improvement (n=50)

Pain Points

Areas of Improvement

Away Team Contributors - Lack of clear documentation around FireFly concepts, Tenets, Guidelines, Best Practices and Engagement model (52%)

Consumers - Enhanced observability and easy E2E trace profiling (60%)

Away Team Contributors - Code reviews take a long time (52%)

Consumers - Unified documentation, Dev Console and Schema exploration feature (50%)

Away Team Contributors - Unable to launch short-term, experimental CX without going through the entire process of reviews, schema design, and implementation (50%)

Away Team Contributors - Low barrier schema experimentation (44%)

Away Team Contributors - Merging my approved branch/ MR takes a long time (38%)

Consumers - Support for API specific Debuggers (34%)

Consumers - Telemetry and service level visibility (34%)

Consumers - Consistent error reporting and handling (22%)

Consumers - FireFly Platform stability (26%)

Consumers - GraphQL Realtime Subscriptions (20%)

Data Providers - Lack of controls to hold specific CXs accountable for service side issues (20%)

Consumers - Consistent Pagination support (18%)

Away Team Contributors - Flaky Integration tests make the pipeline unreliable and unstable (16%)

Consumers - FireFly SDK (16%)

Data Providers - Lack of controls to prevent service level abuse through fanout or out of control dial-ups (12%)

Consumers - Defer & Stream functionality (14%)

Data Providers - Individual service level rate limit protection (12%)

Away Team Contributors - FireFly Image schema enhancement to include additional attributes (10%)

Appendix C: Roadmap Insights

Roadmap Insights 1 Roadmap Insights 2 Roadmap Insights 3

Appendix D: Important Artifacts

Firefly Prioritization Framework and Engagement Model
DevEx Q4’24 Survey Results & Analysis
FireFly OP1 2025
Firefly roadmap 2025
Notes on Gen AI / Dev Productivity
Music Benefits Platform Evolution
Firefly Queries end of 2024
FireFly capabilities developed in 2024
Amazon Music’s Use of OpenTelemetry for DMA and Beyond