Skip to content

Commit 4aa5ae2

Browse files
committed
initial commit
0 parents  commit 4aa5ae2

16 files changed

Lines changed: 1909 additions & 0 deletions

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
*.pyc
2+
*.pth
3+
*.tar.gz
4+
*.egg-info
5+
models/**
6+
data/**
7+
tools/**
8+
checkpoints/**
9+
notebooks/.ipynb_checkpoints/**
10+
.ipynb_checkpoints/**

LICENSE

Lines changed: 407 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# VoiceLoop
2+
PyTorch implementation of the method described in the [Voice Synthesis for in-the-Wild Speakers via a Phonological Loop](https://arxiv.org/abs/1707.06588).
3+
4+
<p align="center"><img width="70%" src="img/method.png" /></p>
5+
6+
VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled
7+
in the wild. Some demo samples can be [found here](https://ytaigman.github.io/loop/).
8+
9+
## Quick Links
10+
- [Demo Samples](https://ytaigman.github.io/loop/)
11+
- [Quick Start](#quick-start)
12+
- [Setup](#setup)
13+
- [Training](#training)
14+
15+
## Quick Start
16+
Follow the instructions in [Setup](#setup) and then simply execute:
17+
```bash
18+
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
19+
```
20+
Results will be placed in ```models/vctk/results```. It will generate 2 samples:
21+
* The [generated sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_10.wav) will be saved with the gen_10.wav extension.
22+
* Its [ground-truth (test) sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.orig.wav) is also generated and is saved with the orig.wav extension.
23+
24+
You can also generate the same text but with a different speaker, specifically:
25+
```bash
26+
python generate.py --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth
27+
```
28+
Which will generate the following [sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_14.wav).
29+
30+
Here is the corresponding attention plot:
31+
32+
<p align="center"><img width="50%" src="img/attn_10.png" /><img width="50%" src="img/attn_14.png" /></p>
33+
34+
Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.
35+
36+
Finally, free text is also supported:
37+
```bash
38+
python generate.py --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
39+
```
40+
41+
## Setup
42+
Requirements: Linux/OSX, Python2.7 and [PyTorch 0.1.12](http://pytorch.org/). The current version of the code requires CUDA support for training. Generation can be done on the CPU.
43+
44+
```bash
45+
git clone https://github.com/facebookresearch/loop.git
46+
cd loop
47+
pip install -r scripts/requirements.txt
48+
```
49+
50+
### Data
51+
The data used to train the models in the paper can be downloaded via:
52+
```bash
53+
bash scripts/download_data.sh
54+
```
55+
56+
The script downloads and preprocesses a subset of [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). This subset contains speakers with american accent.
57+
58+
The dataset was preprocessed using [Merlin](http://www.cstr.ed.ac.uk/projects/merlin/) - from each audio clip we extracted vocoder features using the [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder. After downloading, the dataset will be located under subfolder ```data``` as follows:
59+
60+
```
61+
loop
62+
├── data
63+
   └── vctk
64+
      ├── norm_info
65+
│ ├── norm.dat
66+
   ├── numpy_feautres
67+
│ ├── p294_001.npz
68+
│ ├── p294_002.npz
69+
│ └── ...
70+
   └── numpy_features_valid
71+
```
72+
73+
The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.
74+
75+
### Pretrained Models
76+
Pretrainde models can be downloaded via:
77+
```bash
78+
bash scripts/download_models.sh
79+
```
80+
After downloading, the models will be located under subfolder ```models``` as follows:
81+
82+
```
83+
loop
84+
├── data
85+
├── models
86+
├── vctk
87+
│ ├── args.pth
88+
  │ └── bestmodel.pth
89+
└── vctk_alt
90+
```
91+
92+
93+
### SPTK and WORLD
94+
Finally, speech generation requires [SPTK3.9](http://sp-tk.sourceforge.net/) and [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder as done in Merlin. To download the executables:
95+
```bash
96+
bash scripts/download_tools.sh
97+
```
98+
Which results the following sub directories:
99+
```
100+
loop
101+
├── data
102+
├── models
103+
├── tools
104+
  ├── SPTK-3.9
105+
   └── WORLD
106+
```
107+
108+
## Training
109+
Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:
110+
```bash
111+
python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
112+
```
113+
Then, continue training the model using noise level of 2, on full sequences:
114+
```bash
115+
python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90
116+
```
117+
118+
## Citation
119+
If you find this code useful in your research then please cite:
120+
121+
```
122+
@article{taigman2017voice,
123+
title = {Voice Synthesis for in-the-Wild Speakers via a Phonological Loop},
124+
author = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
125+
journal = {ArXiv e-prints},
126+
archivePrefix = "arXiv",
127+
eprinttype = {arxiv},
128+
eprint = {1705.03122},
129+
primaryClass = "cs.CL",
130+
year = {2017}
131+
month = July,
132+
}
133+
```
134+
135+
## License
136+
Loop has a CC-BY-NC license.

data.py

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Copyright 2017-present, Facebook, Inc.
2+
# All rights reserved.
3+
#
4+
# This source code is licensed under the license found in the
5+
# LICENSE file in the root directory of this source tree.
6+
7+
from functools import partial
8+
import numpy as np
9+
import os
10+
11+
import torch
12+
import torch.utils.data as data
13+
14+
15+
# Taken from
16+
# https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Dataset.py
17+
def batchify(data):
18+
out, lengths = None, None
19+
20+
lengths = [x.size(0) for x in data]
21+
max_length = max(lengths)
22+
23+
if data[0].dim() == 1:
24+
out = data[0].new(len(data), max_length).fill_(0)
25+
for i in range(len(data)):
26+
data_length = data[i].size(0)
27+
out[i].narrow(0, 0, data_length).copy_(data[i])
28+
else:
29+
feat_size = data[0].size(1)
30+
out = data[0].new(len(data), max_length, feat_size).fill_(0)
31+
for i in range(len(data)):
32+
data_length = data[i].size(0)
33+
out[i].narrow(0, 0, data_length).copy_(data[i])
34+
35+
return out, lengths
36+
37+
38+
def collate_by_input_length(batch, max_seq_len):
39+
"Puts each data field into a tensor with outer dimension batch size"
40+
if torch.is_tensor(batch[0]):
41+
return batchify(batch)
42+
elif isinstance(batch[0], int):
43+
return torch.LongTensor(batch)
44+
else:
45+
new_batch = [x for x in batch if x[1].size(0) < max_seq_len]
46+
if len(batch) == 0:
47+
return (None, None), (None, None), None
48+
49+
batch = new_batch
50+
transposed = zip(*batch)
51+
(srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers = \
52+
[collate_by_input_length(samples, max_seq_len)
53+
for samples in transposed]
54+
55+
# within batch sorting by decreasing length for variable length rnns
56+
batch = zip(srcBatch, tgtBatch, tgtLengths, speakers)
57+
batch, srcLengths = zip(*sorted(zip(batch, srcLengths),
58+
key=lambda x: -x[1]))
59+
srcBatch, tgtBatch, tgtLengths, speakers = zip(*batch)
60+
61+
srcBatch = torch.stack(srcBatch, 0).transpose(0, 1).contiguous()
62+
tgtBatch = torch.stack(tgtBatch, 0).transpose(0, 1).contiguous()
63+
srcLengths = torch.LongTensor(srcLengths)
64+
tgtLengths = torch.LongTensor(tgtLengths)
65+
speakers = torch.LongTensor(speakers).view(-1, 1)
66+
67+
return (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers
68+
69+
raise TypeError(("batch must contain tensors, numbers, dicts or \
70+
lists; found {}".format(type(batch[0]))))
71+
72+
73+
class NpzFolder(data.Dataset):
74+
NPZ_EXTENSION = 'npz'
75+
76+
def __init__(self, root):
77+
self.root = root
78+
self.npzs = self.make_dataset(self.root)
79+
80+
if len(self.npzs) == 0:
81+
raise(RuntimeError("Found 0 npz in subfolders of: " + root + "\n"
82+
"Supported image extensions are: " +
83+
self.NPZ_EXTENSION))
84+
85+
self.speakers = []
86+
for fname in self.npzs:
87+
self.speakers += [os.path.basename(fname).split('_')[0]]
88+
self.speakers = list(set(self.speakers))
89+
self.speakers.sort()
90+
self.speakers = {v: i for i, v in enumerate(self.speakers)}
91+
92+
code2phone = np.load(self.npzs[0])['code2phone']
93+
self.dict = {v: k for k, v in enumerate(code2phone)}
94+
95+
def __getitem__(self, index):
96+
path = self.npzs[index]
97+
txt, feat, spkr = self.loader(path)
98+
99+
return txt, feat, self.speakers[spkr]
100+
101+
def __len__(self):
102+
return len(self.npzs)
103+
104+
def make_dataset(self, dir):
105+
images = []
106+
107+
for root, _, fnames in sorted(os.walk(dir)):
108+
for fname in fnames:
109+
if self.NPZ_EXTENSION in fname:
110+
path = os.path.join(root, fname)
111+
images.append(path)
112+
113+
return images
114+
115+
def loader(self, path):
116+
feat = np.load(path)
117+
118+
txt = feat['phonemes'].astype('int64')
119+
txt = torch.from_numpy(txt)
120+
121+
audio = feat['audio_features']
122+
audio = torch.from_numpy(audio)
123+
124+
spkr = os.path.basename(path).split('_')[0]
125+
126+
return txt, audio, spkr
127+
128+
129+
class NpzLoader(data.DataLoader):
130+
def __init__(self, *args, **kwargs):
131+
kwargs['collate_fn'] = partial(collate_by_input_length,
132+
max_seq_len=kwargs['max_seq_len'])
133+
del kwargs['max_seq_len']
134+
135+
data.DataLoader.__init__(self, *args, **kwargs)
136+
137+
138+
class TBPTTIter(object):
139+
"""
140+
Iterator for truncated batch propagation through time(tbptt) training.
141+
Target sequence is segmented while input sequence remains the same.
142+
"""
143+
def __init__(self, src, trgt, spkr, seq_len):
144+
self.seq_len = seq_len
145+
self.start = True
146+
147+
self.speakers = spkr
148+
self.srcBatch = src[0]
149+
self.srcLenths = src[1]
150+
151+
# split batch
152+
self.tgtBatch = list(torch.split(trgt[0], self.seq_len, 0))
153+
self.tgtBatch.reverse()
154+
self.len = len(self.tgtBatch)
155+
156+
# split length list
157+
batch_seq_len = len(self.tgtBatch)
158+
self.tgtLenths = [self.split_length(l, batch_seq_len) for l in trgt[1]]
159+
self.tgtLenths = torch.stack(self.tgtLenths)
160+
self.tgtLenths = list(torch.split(self.tgtLenths, 1, 1))
161+
self.tgtLenths = [x.squeeze() for x in self.tgtLenths]
162+
self.tgtLenths.reverse()
163+
164+
assert len(self.tgtLenths) == len(self.tgtBatch)
165+
166+
def split_length(self, seq_size, batch_seq_len):
167+
seq = [self.seq_len] * (seq_size / self.seq_len)
168+
if seq_size % self.seq_len != 0:
169+
seq += [seq_size % self.seq_len]
170+
seq += [0] * (batch_seq_len - len(seq))
171+
return torch.LongTensor(seq)
172+
173+
def __next__(self):
174+
if len(self.tgtBatch) == 0:
175+
raise StopIteration()
176+
177+
if self.len > len(self.tgtBatch):
178+
self.start = False
179+
180+
return (self.srcBatch, self.srcLenths), \
181+
(self.tgtBatch.pop(), self.tgtLenths.pop()), \
182+
self.speakers, self.start
183+
184+
next = __next__
185+
186+
def __iter__(self):
187+
return self
188+
189+
def __len__(self):
190+
return self.len

0 commit comments

Comments
 (0)