代码教程#
import torch
import IPython.display as ipd
sr = 44100
duration = 5
audio_sample = torch.randn(1, sr * duration)
ipd.Audio(audio_sample.numpy(), rate=sr)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torch
2 import IPython.display as ipd
3 sr = 44100
ModuleNotFoundError: No module named 'torch'
Stable Audio Open 教程#
Stable Audio Open 可通过 HuggingFace 完全获取。要在本地运行 Stable Audio Open,你首先需要为自己生成一个 $HF_TOKEN,具体步骤请参阅 https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication (你需要先注册一个 HuggingFace 账号)。生成 token 后,你需要将其导出为环境变量,使用如下 bash 命令:
export HF_TOKEN="YOUR_HF_TOKEN"
本教程的其余部分基本遵循 Stable Audio Open 公开资源的演示设计:
首先,如果你还没有安装以下依赖包,我们需要先安装它们。直接安装 Stable-Audio-Tools 可能会遇到一些问题,因此我们建议创建一个专用的虚拟环境(不要使用 conda)来运行本 notebook。
# !pip install torch torchaudio torchvision stable-audio-tools einops
如果在本地运行,你可以直接在本地环境中设置 HF_TOKEN(如下所示)。如果你使用的是 Colab notebook,则需要先将 HF_TOKEN 作为“密钥”上传到 Colab,此时下面的命令不会生效。
import os
import warnings
os.environ['HF_TOKEN'] = 'Your API key'
warnings.filterwarnings('ignore', category=FutureWarning)
接下来,我们可以从 HuggingFace 加载模型。请注意,stable-audio-tools 在 M1 Mac 上存在一些已知的依赖问题,因此我们建议使用 Colab notebook(或某个 Linux 系统)来运行本教程。
import torch
import torchaudio
# import librosa
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import IPython.display as ipd
from functools import partial
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
No module named 'flash_attn'
flash_attn not installed, disabling Flash Attention
首先,我们将采样代码封装成一个更简洁的包装函数,因为有一些参数需要提供但并不是特别需要调整的。
# this just cleans things up a bit so the code below highlights the important knobs
easy_generate = partial(generate_diffusion_cond, sample_size=sample_size, sigma_min=0.3, sigma_max=500, device=device)
接下来,我们可以定义条件信息。对于默认的 Stable Audio Open,条件信息包括文本、时间定位和总时长。
# Set up text and timing conditioning
conditioning = [{
"prompt": "clean guitar, sweep picking, 140 bpm, G minor",
"seconds_start": 0, # this says "where" in time the sample is in the song,
"seconds_total": 30 # total sample length in seconds, rest gets padded with silency
}]
seed = 1000
n_steps = 50
cfg = 7.5
sampler = "dpmpp-3m-sde"
output = easy_generate(
model,
conditioning=conditioning,
steps=n_steps, # number of diffusion steps to run
cfg_scale=cfg, # classifier free guidance guidance scale
sampler_type=sampler, # sampling "algorithm", check out https://github.com/Stability-AI/stable-audio-tools/blob/main/stable_audio_tools/inference/sampling.py#L177 for more options
seed=seed,
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()[:, :round(conditioning[0]['seconds_total']*sample_rate)]
1000
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py:265: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.000061.
warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")
现在我们可以聆听生成的输出了!注意:如果在 Colab notebook 上运行,渲染音频会停止自动保存功能,因此如果你想重新开启自动保存,请确保删除代码块的输出。
ipd.display(ipd.Audio(output, rate=sample_rate))