Discover the Spring AI Audio Analysis cookbook for building a powerful, multimodal LLM‑driven audio analysis API. Transcribe, summarize, and analyze audio from files, URLs, or Base64 with our step-by-step tutorial.
1. Introduction
Imagine building an application that can listen to a meeting and provide a perfect transcript, summarize a podcast episode automatically, or even analyze customer service calls for sentiment. That’s not science fiction; it’s the power of AI-driven audio analysis. Today, we’re going to build a comprehensive service that can process audio from multiple sources and provide intelligent insights using modern multimodal AI models.
2. Understanding Multimodality: Teaching AI to Listen
Before we write a single line of code, let’s understand the technology that makes this possible: multimodality. Just as we explored how AI can ‘see’ in our previous Complete Guide to Building a Multimodal AI Image Analysis API with Java Spring Boot, today we’ll teach our application how to ‘hear’.
When we think about audio processing, transcription often comes to mind first. But modern multimodal AI models can do so much more. They can:
- Transcribe speech with high accuracy across multiple languages
- Analyze sentiment and emotion in the speaker’s voice
- Identify multiple speakers and separate their contributions
- Extract key topics and themes from conversations
- Summarize lengthy audio content into actionable insights
- Answer specific questions about the audio content
- Detect background sounds and context beyond just speech
Just like our image analysis service, we’re not building a simple transcription tool – we’re creating an intelligent audio analysis platform that understands context and can answer sophisticated questions about audio content.
3. How Our Multimodal Spring AI Audio Analysis Works
Let’s visualize how our audio analysis system processes different input sources and delivers intelligent insights:
The audio multimodal magic follows this intelligent flow:
- Input Collection: We accept both text prompts and audio from various sources (files, URLs, Base64, or classpath resources)
- Prompt Building: We combine the user’s specific question with the audio media objects into a unified prompt
- Spring AI Processing: The ChatClient handles seamless communication between our application and the multimodal LLM
- AI Analysis: Gemini 2.0 Flash processes both the audio content and text query simultaneously to understand context and extract insights
- Intelligent Response: The AI returns comprehensive analysis that considers both the audio content and the user’s specific requirements
The key insight here is that we’re not just sending Audio to the AI – we’re sending contextual requests. For example:
- Text: “Summarize the main points from this customer service call and identify any areas of concern”
- Audio: [customer_call.mp3]
- AI Response: “This 5-minute customer service call discussed a billing issue. Main points: 1) Customer charged twice for subscription, 2) Issue occurred due to system glitch, 3) Full refund promised within 3-5 days. Areas of concern: Customer expressed frustration about wait time, and this is their second call about the same issue.”
The power lies in combining the “what to do” (the text prompt) with the “what to do it on” (the audio).
4. The Elegance of Audio Processing: Core Implementation
Just like our image analysis implementation, audio processing with Spring AI is remarkably clean and intuitive. The entire process boils down to the same three essential steps:
- Convert any audio to a Spring AI Media object – regardless of source (file, URL, Base64, classpath)
- Create a user prompt combining text and media – Spring AI handles the multimodal complexity
- Call the LLM and extract the response – the AI model processes both audio and text together
Here’s the core code that makes it all happen:
// Step 1: Convert your audio into a Spring AI Media object.
//
// The Media constructor requires two arguments:
// 1. A MimeType (e.g., MimeType.valueOf("audio/mp3") for an MP3 file)
// 2. A Resource, which can be created from various sources:
// – ClassPathResource for audio files in your `src/main/resources` folder
// – UrlResource for audio files at external URLs (like a podcast episode)
// – multipartFile.getResource() for user-uploaded audio files
// – ByteArrayResource for audio data from a decoded Base64 string
//
// Below is an example loading an MP3 file from the classpath:
Media audioMedia = new Media(MimeType.valueOf("audio/mp3"), new ClassPathResource("audio/sample.mp3"));
// Steps 2 & 3: Build prompt, call LLM, get response
return chatClient.prompt()
.user(userSpec -> userSpec
.text("Please transcribe the following audio.") // Text instruction
.media(audioMedia)) // Audio input
.call()
.content();
Core LogicThe beauty of consistency: This pattern works identically for any audio source:
- Classpath:
new Media(mimeType, new ClassPathResource("audio.mp3"))
- File Upload:
new Media(mimeType, multipartFile.getResource())
- URL:
new Media(mimeType, new UrlResource("https://example.com/audio.mp3"))
- Base64:
new Media(mimeType, new ByteArrayResource(decodedBytes))
Whether you’re analyzing a podcast, customer service call, meeting recording, or voice memo, the code structure remains the same – only the source changes.
5. Real-World Application: Building an Audio Analysis API
We will now build a flexible Audio Analysis API. Such a service could be used for:
- Customer Service Analytics – Analyzing call recordings for quality assurance and sentiment analysis
- Meeting Intelligence – Extracting action items, decisions, and key discussion points from recordings
- Podcast Content Management – Generating summaries, chapters, and searchable transcripts
- Voice Assistant Development – Understanding user intent and context from voice commands
- Compliance Monitoring – Ensuring conversations meet regulatory requirements
- Accessibility Services – Generating captions for audio content.
Our API will handle four different audio input formats:
- Classpath resources – Audio files included within the application
- File uploads – Audio files uploaded directly by a user.
- Web URLs – Audio files hosted on the internet (podcasts, streaming services)
- Base64 strings – Encoded audio data sent within a JSON payload.
For any of these inputs, the user will also provide a text prompt (a question or instruction), and our service will return the AI’s analysis.
⚙️ Project Structure & Setup
Below is the folder structure of our Spring Boot application:
spring-ai-audio-analysis-cookbook
├── src
│ └── main
│ ├── java
│ │ └── com
│ │ └──bootcamptoprod
│ │ ├── controller
│ │ │ └── AudioAnalysisController.java
│ │ ├── service
│ │ │ └── AudioAnalysisService.java
│ │ ├── dto
│ │ │ └── AudioAnalysisRequest.java
│ │ │ └── Base64Audio.java
│ │ │ └── Base64AudioAnalysisRequest.java
│ │ │ └── AudioAnalysisResponse.java
│ │ ├── exception
│ │ │ └── AudioProcessingException.java
│ │ ├── SpringAIAudioAnalysisCookbookApplication.java
│ └── resources
│ └── application.yml
│ └── audio
│ └── sample.mp3
└── pom.xml
Project StructureUnderstanding the Project Structure
Here is a quick breakdown of the key files in our project and what each one does:
- SpringAIAudioAnalysisCookbookApplication: The main entry point that bootstraps and starts the entire Spring Boot web application.
- AudioAnalysisController: Exposes the four REST API endpoints (
/from-classpath
,/from-files
,/from-urls
,/from-base64
) to handle incoming audio analysis requests. - AudioAnalysisService.java: Contains the core business logic, converting various audio inputs into Media objects and using the ChatClient to communicate with the AI model for tasks like transcription or summarization.
- AudioAnalysisRequest.java, Base64Audio.java, Base64AudioAnalysisRequest.java, AudioAnalysisResponse.java: Simple record classes (DTOs) that define the JSON structure for our API’s requests and responses.
- AudioProcessingException.java: A custom exception used for handling specific errors like invalid prompts or failed audio downloads, resulting in clear HTTP 400 responses.
- application.yml: Configures the application, including the crucial connection details for the AI model provider (API key, model name, etc.).
- audio/*.mp3: Example audio files stored in the application’s resources, ready to be used by the
/from-classpath
endpoint. - pom.xml: Declares all the necessary Maven dependencies for Spring Web, the Spring AI OpenAI starter, and other required libraries.
Let’s set up our project with the necessary dependencies and configurations.
Step 1: Add Maven Dependencies
Add the below dependencies to pom.xml
file.
<properties>
<java.version>21</java.version>
<spring-ai.version>1.0.0</spring-ai.version>
</properties>
<dependencies>
<!-- Spring Boot Web for building RESTful web services -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- OpenAI Model Support – configureable for various AI providers (e.g. OpenAI, Google Gemini) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
</dependency>
<!-- For Logging Spring AI calls -->
<dependency>
<groupId>org.zalando</groupId>
<artifactId>logbook-spring-boot-starter</artifactId>
<version>3.12.2</version>
</dependency>
<dependencyManagement>
<dependencies>
<!-- Spring AI bill of materials to align all spring-ai versions -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>${spring-ai.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
pom.xmlIn this configuration:
spring-boot-starter-web:
To create our RESTful API endpoint.spring-ai-starter-model-openai:
The Spring AI starter for OpenAI-compatible models. We’ll use it with Google Gemini.logbook-spring-boot-starter
: A handy library for logging HTTP requests and responses. It’s great for debugging.spring-ai-bom:
This Spring AI Bill of Materials (BOM), located in the<dependencyManagement>
section, simplifies our setup. It manages the versions of all Spring AI modules, ensuring they are compatible and preventing potential conflicts with library versions.
Step 2: Configure Application Properties
Next, we configure our application to connect to the AI model.
spring:
application:
name: spring-ai-audio-analysis-cookbook
# AI configurations
ai:
openai:
api-key: ${GEMINI_API_KEY}
base-url: https://generativelanguage.googleapis.com/v1beta/openai
chat:
completions-path: /chat/completions
options:
model: gemini-2.0-flash-exp
# (Optional) For detailed request/response logging
logging:
level:
org.zalando.logbook.Logbook: TRACE
application.yaml📄 Configuration Overview
- We’re using Google’s Gemini model through OpenAI-compatible API endpoints
- The API key is externalized using environment variables for security
- Logging is set to TRACE level to see HTTP traffic going to Google Gemini
Step 3: Application Entry Point
Now, let’s define the main class that boots our Spring Boot app.
package com.bootcamptoprod;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.web.client.RestClientCustomizer;
import org.springframework.context.annotation.Bean;
import org.zalando.logbook.Logbook;
import org.zalando.logbook.spring.LogbookClientHttpRequestInterceptor;
@SpringBootApplication
public class SpringAIAudioAnalysisCookbookApplication {
public static void main(String[] args) {
SpringApplication.run(SpringAIAudioAnalysisCookbookApplication.class, args);
}
/**
* Configures a RestClientCustomizer bean to integrate Logbook for HTTP logging.
* This bean adds an interceptor to all outgoing REST client calls made by Spring,
* allowing us to log the requests sent to the AI model and the responses received.
*/
@Bean
public RestClientCustomizer restClientCustomizer(Logbook logbook) {
return restClientBuilder -> restClientBuilder.requestInterceptor(new LogbookClientHttpRequestInterceptor(logbook));
}
}
SpringAIAudioAnalysisCookbookApplication.javaExplanation:
- Main Class to Run the Application:
SpringAIAudioAnalysisCookbookApplication
is the starting point of our application. When you run this class, Spring Boot initializes all components and starts the embedded server. - HTTP Logging: The
RestClientCustomizer
bean adds HTTP logging to all REST client calls, helping us debug AI model interactions
Step 4: Create Data Transfer Objects (DTOs)
Before diving into the service logic, let’s understand our data contracts. These simple Java record classes define the structure of the JSON data our API will send and receive.
/**
* Represents a single audio file encoded as a Base64 string, including its MIME type.
*/
public record Base64Audio(
String mimeType,
String data // The Base64 encoded audio string
) {}
/**
* Defines the API request body for analyzing one or more Base64 encoded audio files.
*/
public record Base64AudioAnalysisRequest(
List<Base64Audio> base64AudioList,
String prompt
) {}
/**
* Defines the API request body for analyzing audio from URLs or a single classpath file.
*/
public record AudioAnalysisRequest(
List<String> audioUrls,
String prompt,
String fileName
) {}
/**
* Represents the final text response from the AI model, sent back to the client.
* This DTO is used for all successful API responses.
*/
public record AudioAnalysisResponse(
String response
) {}
DTOsExplanation:
- Base64Audio: Represents a single audio file provided as a Base64 encoded string along with its MIME type.
- Base64AudioAnalysisRequest: Defines the request payload for analyzing a list of Base64 encoded audio files with a single text prompt.
- AudioAnalysisRequest: Defines the request payload for analyzing audio files from a list of URLs or a single file from the classpath, along with a text prompt.
- AudioAnalysisResponse: A simple record that wraps the final text analysis (e.g., a transcription or summary) received from the AI model for all API responses.
Step 5: The Controller Layer: API Endpoints
Our controller exposes four endpoints, each handling different image input methods:
package com.bootcamptoprod.controller;
import com.bootcamptoprod.dto.AudioAnalysisRequest;
import com.bootcamptoprod.dto.AudioAnalysisResponse;
import com.bootcamptoprod.dto.Base64AudioAnalysisRequest;
import com.bootcamptoprod.exception.AudioProcessingException;
import com.bootcamptoprod.service.AudioAnalysisService;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.ExceptionHandler;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.multipart.MultipartFile;
import java.util.List;
@RestController
@RequestMapping("/api/v1/audio/analysis")
public class AudioAnalysisController {
private final AudioAnalysisService audioAnalysisService;
// Constructor Injection
public AudioAnalysisController(AudioAnalysisService audioAnalysisService) {
this.audioAnalysisService = audioAnalysisService;
}
/**
* SCENARIO 1: Analyze a single audio file from the classpath (e.g., src/main/resources/audio).
*/
@PostMapping("/from-classpath")
public ResponseEntity<AudioAnalysisResponse> analyzeFromClasspath(@RequestBody AudioAnalysisRequest request) {
AudioAnalysisResponse response = audioAnalysisService.analyzeAudioFromClasspath(request.fileName(), request.prompt());
return ResponseEntity.ok(response);
}
/**
* SCENARIO 2: Analyze one or more audio files uploaded by the user.
* This endpoint handles multipart/form-data requests.
*/
@PostMapping(value = "/from-files", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public ResponseEntity<AudioAnalysisResponse> analyzeFromFiles(@RequestParam("audioFiles") List<MultipartFile> audioFiles, @RequestParam("prompt") String prompt) {
AudioAnalysisResponse response = audioAnalysisService.analyzeAudioFromFile(audioFiles, prompt);
return ResponseEntity.ok(response);
}
/**
* SCENARIO 3: Analyze one or more audio files from a list of URLs.
*/
@PostMapping("/from-urls")
public ResponseEntity<AudioAnalysisResponse> analyzeFromUrls(@RequestBody AudioAnalysisRequest request) {
AudioAnalysisResponse response = audioAnalysisService.analyzeAudioFromUrl(request.audioUrls(), request.prompt());
return ResponseEntity.ok(response);
}
/**
* SCENARIO 4: Analyze one or more audio files from Base64-encoded strings.
*/
@PostMapping("/from-base64")
public ResponseEntity<AudioAnalysisResponse> analyzeFromBase64(@RequestBody Base64AudioAnalysisRequest request) {
AudioAnalysisResponse response = audioAnalysisService.analyzeAudioFromBase64(request.base64AudioList(), request.prompt());
return ResponseEntity.ok(response);
}
/**
* Centralized exception handler for this controller.
* Catches our custom exception from the service layer and returns a clean
* HTTP 400 Bad Request with the error message.
*/
@ExceptionHandler(AudioProcessingException.class)
public ResponseEntity<AudioAnalysisResponse> handleAudioProcessingException(AudioProcessingException ex) {
return ResponseEntity.badRequest().body(new AudioAnalysisResponse(ex.getMessage()));
}
}
AudioAnalysisController.javaThe AudioAnalysisController
acts as the entry point for all incoming web requests. It defines the specific URLs (endpoints) for our four different audio analysis scenarios and delegates the heavy lifting to the AudioAnalysisService
.
Here is a breakdown of its responsibilities:
- /from-classpath: Accepts a JSON request to analyze a single audio file located within the application’s resources folder.
- /from-files: Handles
multipart/form-data
requests, allowing users to upload one or more audio files for analysis along with a prompt. - /from-urls: Processes a JSON request containing a list of public audio URLs, downloading and analyzing each one against the user’s prompt.
- /from-base64: Accepts a JSON payload with a list of Base64-encoded audio strings, making it easy to send audio data directly in the request body.
- @ExceptionHandler: Acts as a centralized error gateway, catching our custom
AudioProcessingException
and returning a user-friendly HTTP 400 Bad Request with a clear error message.
Step 6: Custom Exception Handling
To handle predictable errors like invalid URLs or empty prompts gracefully, we use a dedicated custom exception.
package com.bootcamptoprod.exception;
import org.springframework.http.HttpStatus;
import org.springframework.web.bind.annotation.ResponseStatus;
// Custom exception that will result in an HTTP 400 Bad Request response
@ResponseStatus(HttpStatus.BAD_REQUEST)
public class AudioProcessingException extends RuntimeException {
public AudioProcessingException(String message) {
super(message);
}
public AudioProcessingException(String message, Throwable cause) {
super(message, cause);
}
}
AudioProcessingException.javaExplanation:
This simple class is a powerful tool for creating clean and predictable REST APIs.
- This class defines a custom, unchecked exception, which allows us to handle predictable errors without cluttering our service-layer methods with throws declarations.
- The crucial element is the
@ResponseStatus(HttpStatus.BAD_REQUEST)
annotation. This powerful Spring feature instructs the framework to automatically convert this exception into an HTTP 400 Bad Request response whenever it’s thrown. This makes it perfect for handling client-side problems, such as an invalid audio URL, an empty file upload, or a malformed Base64 string. - In our
AudioAnalysisController
, we also have a specific@ExceptionHandler
for this type. This gives us the best of both worlds:@ResponseStatus
provides a sensible default, while our handler allows us to customize the exact JSON response body, ensuring the user gets a clear and helpful error message.
Step 7: The Heart of the Application: Service Implementation
Now let’s explore the core service that handles all the audio processing logic. This is where the multimodal AI magic happens, converting spoken words from various sources into accurate transcriptions and intelligent summaries.
@Service
public class AudioAnalysisService {
private static final Logger log = LoggerFactory.getLogger(AudioAnalysisService.class);
// A single, reusable system prompt that defines the AI's persona and rules for audio.
private static final String SYSTEM_PROMPT_TEMPLATE = getSystemPrompt();
// A constant to programmatically check if the AI followed our rules.
private static final String AI_ERROR_RESPONSE = "Error: I can only analyze audio and answer related questions.";
private final ChatClient chatClient;
// The ChatClient.Builder is injected by Spring, allowing us to build the client.
public AudioAnalysisService(ChatClient.Builder chatClientBuilder) {
this.chatClient = chatClientBuilder.build();
}
/**
* System prompt that defines the AI's behavior and boundaries for audio tasks.
*/
private static String getSystemPrompt() {
return """
You are an AI assistant that specializes in audio analysis.
Your task is to analyze the provided audio file(s) and answer the user's question.
Common tasks are transcribing speech to text or summarizing the content.
If the user's prompt is not related to analyzing the audio,
respond with the exact phrase: 'Error: I can only analyze audio and answer related questions.'
""";
}
}
AudioAnalysisService.javaExplanation:
- ChatClient: This is the central interface from the Spring AI framework, serving as our main gateway to the Large Language Model (LLM). We use Spring’s dependency injection to receive a
ChatClient.Builder
in the constructor, which we then use to build and configure our client instance. - SYSTEM_PROMPT_TEMPLATE: This constant represents a powerful “meta-instruction” that sets the stage for every interaction with the AI. A system prompt establishes the rules of engagement and defines the AI’s specific persona. In our case, it instructs the AI on two crucial points:
- Its role is to be an “audio analysis expert,” specializing in tasks like transcription and summarization.
- It must use a precise error message if the user’s request is unrelated to the provided audio. This acts as a vital “guardrail,” ensuring the AI’s responses remain relevant and predictable.
- AI_ERROR_RESPONSE: We define the exact error phrase from our system prompt as a static final string. This is not just for reference; it allows our code to programmatically validate the AI’s output. By checking if the response matches this constant, we can reliably detect when a user’s prompt was out of scope and handle it as a client-side error.
Step 7.1: Core Analysis Method: The Multimodal AI Communication Hub
All roads lead to this private method. It doesn’t matter how the audio is provided (as a file, a URL, etc.)—it eventually gets passed to performAnalysis
. This method is the engine room of our service, responsible for actually talking to the AI.
/**
* This is the CORE method that communicates with the AI.
* It is called by all the public service methods.
*/
private AudioAnalysisResponse performAnalysis(String prompt, List <Media> mediaList) {
if (mediaList.isEmpty()) {
throw new AudioProcessingException("No valid audio files were provided for analysis.");
}
// This is where the magic happens: combining text and media in one call.
String response = chatClient.prompt()
.system(SYSTEM_PROMPT_TEMPLATE)
.user(userSpec -> userSpec
.text(prompt) // The user's text instruction
.media(mediaList.toArray(new Media[0]))) // The list of audio files
.call()
.content();
// Check if the AI responded with our predefined error message (a "guardrail").
if (AI_ERROR_RESPONSE.equalsIgnoreCase(response)) {
throw new AudioProcessingException("The provided prompt is not related to audio analysis.");
}
return new AudioAnalysisResponse(response);
}
AudioAnalysisService.javaUnderstanding the multimodal communication flow:
- System Instructions (
.system(SYSTEM_PROMPT_TEMPLATE)
): This sets the ground rules and tells the AI what its job is. - User Specification: This combines both the text instruction and the audio data:
.text(prompt)
: The user’s question or command (e.g., “Transcribe this meeting”)..media(mediaList.toArray())
: The audio file(s) we want the AI to analyze.
- AI Processing: The model “listens” to the audio and “reads” the text instruction at the same time to understand the request.
- Response Validation: Our code checks if the AI followed our rules and didn’t give an off-topic answer.
- Structured Return: The final text from the AI is wrapped in our
AudioAnalysisResponse
object.
The beauty of this design is its simplicity. No matter how the audio arrives, it’s always converted into a standard Media
object, making our AI communication code clean and uniform.
Step 7.2: Scenario 1: Analyzing Audio from the Classpath (Resources Folder)
This scenario is perfect for analyzing audio files that are packaged with your application—great for demos, testing, or default sound clips.
public AudioAnalysisResponse analyzeAudioFromClasspath(String fileName, String prompt) {
validatePrompt(prompt);
if (!StringUtils.hasText(fileName)) {
throw new AudioProcessingException("File name cannot be empty.");
}
// Assumes audio files are located in `src/main/resources/audio/`
Resource audioResource = new ClassPathResource("audio/" + fileName);
if (!audioResource.exists()) {
throw new AudioProcessingException("File not found in classpath: audio/" + fileName);
}
// We assume MP3 for this example, but you could determine this dynamically.
Media audioMedia = new Media(MimeType.valueOf("audio/mp3"), audioResource);
return performAnalysis(prompt, List.of(audioMedia));
}
AudioAnalysisService.javaKey Details:
- Validate Input: It first ensures the prompt and fileName are not empty.
- Find Resource: It uses Spring’s
ClassPathResource
to locate the audio file within thesrc/main/resources/audio/
directory of your project. - Check Existence: It calls
.exists()
to make sure the file was actually found, throwing an error if not. - Create Media: It wraps the Resource in a Spring AI Media object, specifying its MIME type.
- Perform Analysis: It calls the central
performAnalysis
method, wrapping the single Media object in a List.
Step 7.3: Scenario 2: Analyzing Uploaded Audio Files (Multipart)
This is the most common use case for a web application, letting users upload audio files directly from their computer or phone.
public AudioAnalysisResponse analyzeAudioFromFile(List <MultipartFile> audios, String prompt) {
validatePrompt(prompt);
if (audios == null || audios.isEmpty() || audios.stream().allMatch(MultipartFile::isEmpty)) {
throw new AudioProcessingException("Audio files list cannot be empty.");
}
List <Media> mediaList = audios.stream()
.filter(file -> !file.isEmpty())
.map(this::convertMultipartFileToMedia) // Convert each file to a Media object
.collect(Collectors.toList());
return performAnalysis(prompt, mediaList);
}
/**
* Helper method for converting an uploaded MultipartFile into a Spring AI Media object.
*/
private Media convertMultipartFileToMedia(MultipartFile file) {
String contentType = file.getContentType();
MimeType mimeType = determineAudioMimeType(contentType);
return new Media(mimeType, file.getResource());
}
/**
* Helper method to determine MimeType from a content type string for common audio formats.
*/
private MimeType determineAudioMimeType(String contentType) {
if (contentType == null) {
return MimeType.valueOf("audio/mp3"); // Default fallback
}
return switch (contentType.toLowerCase()) {
case "audio/wav", "audio/x-wav" -> MimeType.valueOf("audio/wav");
default -> MimeType.valueOf("audio/mp3");
};
}
AudioAnalysisService.javaKey Details:
- Process List: The method takes a list of uploaded files
List<MultipartFile>
and uses a Java Stream to handle them efficiently, ignoring any that might be empty. - Convert Each File to a Media Object: For each valid file, our code performs a crucial three-step conversion using the
convertMultipartFileToMedia
helper:- Get Content Type: First, it identifies the audio format by checking the file’s content type (e.g., “audio/wav”).
- Determine MIME Type: It then passes this text to our
determineAudioMimeType
helper. This small utility translates the plain text into a proper MimeType object, which is the format Spring AI officially recognizes. - Create the Media Object: Finally, it packages the audio file’s data
file.getResource()
and its verified MimeType into a new Media object, ready to be sent to the AI.
- Collect and Analyze: The stream gathers all the newly created Media objects into a list and sends it off to the main
performAnalysis
method to be processed by the AI.
Step 7.4: Scenario 3: Analyzing Audio from Web URLs
This is perfect for analyzing audio that’s already online, like a podcast episode, without making the user download and re-upload it.
public AudioAnalysisResponse analyzeAudioFromUrl(List <String> audioUrls, String prompt) {
validatePrompt(prompt);
if (audioUrls == null || audioUrls.isEmpty()) {
throw new AudioProcessingException("Audio URL list cannot be empty.");
}
List <Media> mediaList = audioUrls.stream()
.map(this::convertUrlToMedia)
.collect(Collectors.toList());
return performAnalysis(prompt, mediaList);
}
/**
* Helper method for processing an audio file from a URL and converting it into a Media object.
*/
private Media convertUrlToMedia(String audioUrl) {
try {
log.info("Processing audio from URL: {}", audioUrl);
URL url = new URL(audioUrl);
// Set timeouts to prevent the application from hanging on slow network requests.
URLConnection connection = url.openConnection();
connection.setConnectTimeout(5000); // 5-second timeout
connection.setReadTimeout(5000); // 5-second timeout
// Get the MIME type from the URL's response headers to validate it's an audio file.
String contentType = connection.getContentType();
if (contentType == null || !contentType.startsWith("audio/")) {
throw new AudioProcessingException("Invalid or non-audio MIME type for URL: " + audioUrl);
}
Resource audioResource = new UrlResource(audioUrl);
return new Media(MimeType.valueOf(contentType), audioResource);
} catch (Exception e) {
throw new AudioProcessingException("Failed to download or process audio from URL: " + audioUrl, e);
}
}
AudioAnalysisService.javaKey Details:
- Process URLs: The public method streams the list of URL strings and maps each one to the
convertUrlToMedia
helper. - Establish Connection: The helper method creates a
URLConnection
to the audio URL. This prevents our app from getting stuck if the remote server is slow. - Get MIME Type: It checks the
Content-Type
header from the server’s response. This is a great way to confirm the URL points to an audio file before downloading it. - Create Resource: It creates a
UrlResource
, a special Spring Resource that represents data at a URL. Spring AI handles the lazy-loading of the data when it’s needed. - Error Handling: The entire block is wrapped in a try-catch to handle exceptions (e.g., network errors, 404 Not Found), converting them into our user-friendly
AudioProcessingException
.
Step 7.5: Scenario 4: Analyzing Base64 Encoded Audio
This method is handy for APIs where the audio data is sent directly inside a JSON request, which is a common pattern for mobile apps or web frontends.
public AudioAnalysisResponse analyzeAudioFromBase64(List <Base64Audio> base64Audios, String prompt) {
validatePrompt(prompt);
if (base64Audios == null || base64Audios.isEmpty()) {
throw new AudioProcessingException("Base64 audio list cannot be empty.");
}
List <Media> mediaList = base64Audios.stream()
.map(this::convertBase64ToMedia)
.collect(Collectors.toList());
return performAnalysis(prompt, mediaList);
}
/**
* Helper method for decoding a Base64 string into a Media object.
*/
private Media convertBase64ToMedia(Base64Audio base64Audio) {
if (!StringUtils.hasText(base64Audio.mimeType()) || !StringUtils.hasText(base64Audio.data())) {
throw new AudioProcessingException("Base64 audio data and MIME type cannot be empty.");
}
try {
// Decode the Base64 string back into its original binary format (byte array).
byte[] decodedBytes = Base64.getDecoder().decode(base64Audio.data());
// Wrap the byte array in a resource and create the Media object.
return new Media(MimeType.valueOf(base64Audio.mimeType()), new ByteArrayResource(decodedBytes));
} catch (Exception e) {
throw new AudioProcessingException("Invalid Base64 data provided.", e);
}
}
AudioAnalysisService.javaKey Details:
- Process List: Similar to the other scenarios, this method streams the list of
Base64Audio
objects and maps them to theconvertBase64ToMedia
helper. - Decode Data: The helper’s main job is to convert the long Base64 text string back into its original binary data (a
byte[]
array). - Wrap in Resource: It then uses Spring’s
ByteArrayResource
to wrap this byte array. This is an efficient, in-memory implementation of the Resource interface. - Create Media: It creates the Media object using the MIME type provided in the request and the ByteArrayResource.
- Error Handling: The decoding step is inside a try-catch block, so if the client sends invalid Base64 data, our app won’t crash and will instead return a clean error.
6. Testing the Application
Once the application is started, you can test each of the four endpoints using a command-line tool like cURL, or you can also test using Postman. Here are example requests for each scenario.
1. Analyze an Audio from the Classpath
This endpoint uses an Audio bundled inside the application’s src/main/resources/audio
folder.
curl -X POST http://localhost:8080/api/v1/audio/analysis/from-classpath \
-H "Content-Type: application/json" \
-d '{
"fileName": "sample.mp3",
"prompt": "Transcribe this audio file"
}'
cURL2. Analyze an Uploaded Audio File
This endpoint accepts a standard file upload. Make sure to replace /path/to/your/audiofileX.mp3
with the actual path to audio files on your computer.
curl -X POST 'localhost:8080/api/v1/audio/analysis/from-files?prompt=analyse sentiment in these audio files' \
--form 'audioFiles=@/path/to/your/audioFile1.mp3' \
--form 'audioFiles=@/path/to/your/audioFile2.mp3'
cURL3. Analyze an Audio File from a URL
You can use any publicly accessible audio URL for this endpoint, like a podcast episode.
curl -X POST http://localhost:8080/api/v1/audio/analysis/from-urls \
-H "Content-Type: application/json" \
-d '{
"audioUrls": ["https://some-public-cdn.com/podcast.mp3"],
"prompt": "Transcribe this audio in hindi"
}'
cURL4. Analyze a Base64-Encoded Audio File
For this request, you need to provide the audio data as a Base64 text string. You’ll need to replace the value of “data” with your own encoded audio.
Tip: On macOS or Linux, you can easily generate a Base64 string from a file and copy it to your clipboard with the command: base64 -i your_audio.mp3 | pbcopy
curl -X POST http://localhost:8080/api/v1/audio/analysis/from-base64 \
-H "Content-Type: application/json" \
-d '{
"base64AudioList": [
{
"mimeType": "audio/mp3",
"data": "SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAA..."
}
],
"prompt": "Generate subtitles from this audio with timestamps every 2 seconds"
}'
cURLOutput:
7. Peeking Under the Hood – Logging HTTP Calls to the LLM
When you’re working with an AI, have you ever wondered what your app is actually sending it? It’s incredibly helpful to see the exact request sent to the AI model and the raw response it sends back. This is the best way to debug a prompt that isn’t giving you the results you expect.
In our project, we made this easy by using a neat library called Logbook. Setting it up only took three simple steps:
- We added the Logbook dependency to our pom.xml file.
- We configured a RestClientCustomizer bean in our main application class, which tells Spring to use Logbook for its web calls.
- We set the logging level to TRACE in our application.yml file to see all the details.
With that simple setup, every conversation our ChatClient has with the AI model is automatically printed to the console for us to see.
For a more in-depth guide on this topic, you can read our detailed article: Spring AI Log Model Requests and Responses – 3 Easy Ways
👉 Sample Request Log
Here is an example of what the log looks like when we send a request. You can clearly see our system instructions, the user’s prompt (“Transcribe this audio file”), and the audio data itself (shortened for brevity) being sent to the model.
2025-07-25T01:16:34.310+05:30 TRACE 2327 --- [spring-ai-audio-analysis-cookbook] [nio-8080-exec-7] org.zalando.logbook.Logbook : {
"origin": "local",
"type": "request",
"correlation": "d1234128b16dec7c",
"protocol": "HTTP/1.1",
"remote": "localhost",
"method": "POST",
"uri": "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
"host": "generativelanguage.googleapis.com",
"path": "/v1beta/openai/chat/completions",
"scheme": "https",
"port": null,
"headers": {
"Authorization": ["XXX"],
"Content-Length": ["436145"],
"Content-Type": ["application/json"]
},
"body": {
"messages": [
{
"content": "You are an AI assistant that specializes in audio analysis.\nYour task is to analyze the provided audio file(s) and answer the user's question.\nCommon tasks are transcribing speech to text or summarizing the content.\nIf the user's prompt is not related to analyzing the audio,\nrespond with the exact phrase: 'Error: I can only analyze audio and answer related questions.'\n",
"role": "system"
},
{
"content": [
{
"type": "text",
"text": "Transcribe this audio file"
},
{
"type": "input_audio",
"input_audio": {
"data": "SUQzBAAAAAAAI1RTU0UAAAAPAAADAAAAAAAASW5mbwAAA....",
"format": "mp3"
}
}
],
"role": "user"
}
],
"model": "gemini-2.0-flash-exp",
"stream": false,
"temperature": 0.7
}
}
Application Logs👉 Sample Response Log
Here is the corresponding response from the LLM, showing a successful status: 200, and the model’s full analysis, which is invaluable for troubleshooting.
2025-07-25T01:16:36.764+05:30 TRACE 2327 --- [spring-ai-audio-analysis-cookbook] [nio-8080-exec-7] org.zalando.logbook.Logbook : {
"origin": "remote",
"type": "response",
"correlation": "d1234128b16dec7c",
"duration": 2476,
"protocol": "HTTP/1.1",
"status": 200,
"headers": {
...
},
"body": {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "In this guide, we'll build a multimodal AI audio analysis API using Spring AI that can transcribe audio, detect sentiment, identify speakers, and extract meaningful insights from various audio sources.",
"role": "assistant"
}
}
],
"created": 1753386396,
"id": "m42CaJCbNarUnvgP1qSawAo",
"model": "gemini-2.0-flash-exp",
"object": "chat.completion",
"usage": {
"completion_tokens": 38,
"prompt_tokens": 412,
"total_tokens": 450
}
}
}
Application Logs8. Video Tutorial
If you prefer visual learning, check out our step-by-step video tutorial. It walks you through this versatile Audio Analysis API from scratch, demonstrating how to handle different audio sources and interact with a multimodal LLM using Spring AI.
📺 Watch on YouTube:
9. Source Code
The complete source code for this Spring AI audio analysis project is available on our GitHub. Just clone the repository, plug in your API key, and run it locally. It’s the best way to play around with the code and understand how all the pieces fit together.
🔗 Spring AI Audio Analysis: https://github.com/BootcampToProd/spring-ai-audio-analysis-cookbook
10. Things to Consider
When implementing this audio analysis solution in production environments, consider these critical factors:
- Performance and Scalability: Audio files are significantly larger than images, often ranging from a few MB to several GB for long recordings. Implement streaming processing where possible, use cloud storage for uploaded files, and consider chunking large audio files for processing.
- Cost Management: Audio processing can be expensive due to file sizes and processing time. Monitor API usage carefully, implement caching for frequently analyzed content, and consider preprocessing to reduce file sizes while maintaining quality.
- Security and Privacy: Always validate file sizes and types to prevent malicious uploads. Audio often contains sensitive information. Implement encryption for stored files, ensure secure transmission, comply with privacy regulations like GDPR, and provide options for automatic file deletion after processing.
- Error Handling and Resilience: Network timeouts are more common with large audio files. Implement robust retry mechanisms, provide progress indicators for long-running operations, and handle partial failures gracefully in multi-file scenarios.
- Audio Format Support: Different AI models support different audio formats. Validate formats before processing, consider implementing format conversion, and provide clear error messages for unsupported formats.
- Memory Usage Monitoring: Track memory usage when handling large audio files to avoid OOMs.
- Usage Alerts: Monitor and alert on AI API usage to prevent budget overruns.
11. FAQs
Can I analyze live audio streams?
The current implementation handles static audio files. For live streams, you’d need to capture audio chunks and analyze them sequentially.
How accurate is the transcription compared to specialized services?
Modern multimodal AI models provide highly accurate transcription, often comparable to dedicated transcription services. However, accuracy can vary based on audio quality, accents, and background noise.
Can I extract timestamps for different parts of the audio?
Yes! You can ask the AI to provide timestamps for key moments, speaker changes, or topic transitions. For example: “Provide a timeline of topics discussed with timestamps.
Can I use other AI models besides Gemini?
Yes! Spring AI supports multiple providers including OpenAI GPT, Anthropic Claude, and Azure OpenAI. Simply change the configuration in application.yaml.
What audio formats are supported?
Common formats like MP3 and WAV are generally supported. However, support can vary depending on the AI model you’re using. It’s a good idea to verify the compatibility with your chosen model. You can also add checks in your application to only accept supported formats.
How long can audio files be?
The maximum audio duration depends on both the AI model’s capabilities and your app’s timeout or file handling settings. To handle this reliably, you can implement validation in your app to restrict file size and duration. For longer recordings, consider breaking them into smaller segments before processing.
How can I process long audio files, like a 1-hour lecture?
You should implement “chunking.” This involves programmatically splitting the large audio file into smaller pieces (e.g., 10-15 minutes each) and sending them to the API one by one, then reassembling the results.
12. Conclusion
In this guide, we built a powerful multimodal AI audio analysis API using Spring AI, demonstrating how modern AI can understand and analyze audio content effectively. The architecture is flexible, supporting different audio input methods like classpath files, system uploads, and Base64-encoded data—making it adaptable to various use cases such as customer support, meeting transcription, or podcast management. This foundation is scalable, reliable, and ready for production, with the added benefit that as AI models improve, your application will continue to get smarter without needing major changes to the code.
13. Learn More
Interested in learning more?
Spring AI Image Analysis: Building Powerful Multimodal LLM APIs
Add a Comment