代码教程

代码教程#

import torch
import IPython.display as ipd
sr = 44100
duration = 5
audio_sample = torch.randn(1, sr * duration)
ipd.Audio(audio_sample.numpy(), rate=sr)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
      2 import IPython.display as ipd
      3 sr = 44100

ModuleNotFoundError: No module named 'torch'

Stable Audio Open 教程#

Stable Audio Open 可通过 HuggingFace 完全获取。要在本地运行 Stable Audio Open,你首先需要为自己生成一个 $HF_TOKEN,具体步骤请参阅 https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication (你需要先注册一个 HuggingFace 账号)。生成 token 后,你需要将其导出为环境变量,使用如下 bash 命令:

export HF_TOKEN="YOUR_HF_TOKEN"

本教程的其余部分基本遵循 Stable Audio Open 公开资源的演示设计:

首先,如果你还没有安装以下依赖包,我们需要先安装它们。直接安装 Stable-Audio-Tools 可能会遇到一些问题,因此我们建议创建一个专用的虚拟环境(不要使用 conda)来运行本 notebook。

# !pip install torch torchaudio torchvision stable-audio-tools einops

如果在本地运行,你可以直接在本地环境中设置 HF_TOKEN(如下所示)。如果你使用的是 Colab notebook,则需要先将 HF_TOKEN 作为“密钥”上传到 Colab,此时下面的命令不会生效。

import os
import warnings
os.environ['HF_TOKEN'] = 'Your API key'
warnings.filterwarnings('ignore', category=FutureWarning)

接下来,我们可以从 HuggingFace 加载模型。请注意,stable-audio-tools 在 M1 Mac 上存在一些已知的依赖问题,因此我们建议使用 Colab notebook(或某个 Linux 系统)来运行本教程。

import torch
import torchaudio
# import librosa
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import IPython.display as ipd
from functools import partial

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
No module named 'flash_attn'
flash_attn not installed, disabling Flash Attention

首先,我们将采样代码封装成一个更简洁的包装函数,因为有一些参数需要提供但并不是特别需要调整的。

# this just cleans things up a bit so the code below highlights the important knobs
easy_generate = partial(generate_diffusion_cond, sample_size=sample_size, sigma_min=0.3, sigma_max=500, device=device)

接下来,我们可以定义条件信息。对于默认的 Stable Audio Open,条件信息包括文本、时间定位和总时长。

# Set up text and timing conditioning
conditioning = [{
    "prompt": "clean guitar, sweep picking, 140 bpm, G minor",
    "seconds_start": 0, # this says "where" in time the sample is in the song,
    "seconds_total": 30 # total sample length in seconds, rest gets padded with silency
}]
seed = 1000
n_steps = 50
cfg = 7.5
sampler = "dpmpp-3m-sde"

output = easy_generate(
    model,
    conditioning=conditioning,
    steps=n_steps, # number of diffusion steps to run
    cfg_scale=cfg, # classifier free guidance guidance scale
    sampler_type=sampler, # sampling "algorithm", check out https://github.com/Stability-AI/stable-audio-tools/blob/main/stable_audio_tools/inference/sampling.py#L177 for more options
    seed=seed,
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()[:, :round(conditioning[0]['seconds_total']*sample_rate)]
1000
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py:265: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.000061.
  warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")

现在我们可以聆听生成的输出了!注意:如果在 Colab notebook 上运行,渲染音频会停止自动保存功能,因此如果你想重新开启自动保存,请确保删除代码块的输出。

ipd.display(ipd.Audio(output, rate=sample_rate))