speechprojects
diff --git a/‎.gitignore‎
Lines changed: 10 additions & 0 deletions b/‎.gitignore‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 407 additions & 0 deletions b/‎LICENSE‎
Lines changed: 407 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 136 additions & 0 deletions b/‎README.md‎
Lines changed: 136 additions & 0 deletions
diff --git a/‎data.py‎
Lines changed: 190 additions & 0 deletions b/‎data.py‎
Lines changed: 190 additions & 0 deletions
@@ -0,0 +1,10 @@
+*.pyc
+*.pth
+*.tar.gz
+*.egg-info
+models/**
+data/**
+tools/**
+checkpoints/**
+notebooks/.ipynb_checkpoints/**
+.ipynb_checkpoints/**
@@ -0,0 +1,136 @@
+# VoiceLoop
+PyTorch implementation of the method described in the [Voice Synthesis for in-the-Wild Speakers via a Phonological Loop](https://arxiv.org/abs/1707.06588).
+
+<p align="center"><img width="70%" src="img/method.png" /></p>
+
+VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled
+in the wild. Some demo samples can be [found here](https://ytaigman.github.io/loop/).
+
+## Quick Links
+- [Demo Samples](https://ytaigman.github.io/loop/)
+- [Quick Start](#quick-start)
+- [Setup](#setup)
+- [Training](#training)
+
+## Quick Start
+Follow the instructions in [Setup](#setup) and then simply execute:
+ ```bash
+ python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth
+ ```
+ Results will be placed in ```models/vctk/results```. It will generate 2 samples: 
+  * The [generated sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_10.wav) will be saved with the gen_10.wav extension.
+  * Its [ground-truth (test) sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.orig.wav) is also generated and is saved with the orig.wav extension.
+  
+You can also generate the same text but with a different speaker, specifically:
+ ```bash
+ python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth
+ ```
+Which will generate the following [sample](https://ytaigman.github.io/loop/demos/vctk_tutorial/p318_212.gen_14.wav). 
+
+Here is the corresponding attention plot: 
+
+<p align="center"><img width="50%" src="img/attn_10.png" /><img width="50%" src="img/attn_14.png" /></p>
+
+Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14. 
+
+Finally, free text is also supported:
+ ```bash
+python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth
+```
+
+## Setup
+Requirements: Linux/OSX, Python2.7 and [PyTorch 0.1.12](http://pytorch.org/). The current version of the code requires CUDA support for training. Generation can be done on the CPU.
+
+```bash
+git clone https://github.com/facebookresearch/loop.git
+cd loop
+pip install -r scripts/requirements.txt
+```
+
+### Data
+The data used to train the models in the paper can be downloaded via:
+```bash
+bash scripts/download_data.sh
+```
+
+The script downloads and preprocesses a subset of [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html). This subset contains speakers with american accent.  
+
+The dataset was preprocessed using [Merlin](http://www.cstr.ed.ac.uk/projects/merlin/) - from each audio clip we extracted vocoder features using the [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder. After downloading, the dataset will be located under subfolder ```data``` as follows:
+
+```
+loop
+├── data
+    └── vctk
+        ├── norm_info
+        │   ├── norm.dat
+        ├── numpy_feautres
+        │   ├── p294_001.npz
+        │   ├── p294_002.npz
+        │   └── ...
+        └── numpy_features_valid
+```
+
+The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.
+
+### Pretrained Models
+Pretrainde models can be downloaded via:
+```bash
+bash scripts/download_models.sh
+```
+After downloading, the models will be located under subfolder ```models``` as follows:
+
+```
+loop
+├── data
+├── models
+    ├── vctk
+    │   ├── args.pth
+    │   └── bestmodel.pth
+    └── vctk_alt
+```
+
+
+### SPTK and WORLD
+Finally, speech generation requires [SPTK3.9](http://sp-tk.sourceforge.net/) and [WORLD](http://ml.cs.yamanashi.ac.jp/world/english/) vocoder as done in Merlin. To download the executables: 
+```bash
+bash scripts/download_tools.sh
+```
+Which results the following sub directories:
+```
+loop
+├── data
+├── models
+├── tools
+    ├── SPTK-3.9
+    └── WORLD
+```
+ 
+## Training
+Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:
+```bash
+python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90
+```
+Then, continue training the model using noise level of 2, on full sequences:
+```bash
+python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90
+```
+
+## Citation
+If you find this code useful in your research then please cite:
+
+```
+@article{taigman2017voice,
+  title           = {Voice Synthesis for in-the-Wild Speakers via a Phonological Loop},
+  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
+  journal         = {ArXiv e-prints},
+  archivePrefix   = "arXiv",
+  eprinttype      = {arxiv},
+  eprint          = {1705.03122},
+  primaryClass    = "cs.CL",
+  year            = {2017}
+  month           = July,
+}
+```
+
+## License
+Loop has a CC-BY-NC license.
@@ -0,0 +1,190 @@
+# Copyright 2017-present, Facebook, Inc.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from functools import partial
+import numpy as np
+import os
+
+import torch
+import torch.utils.data as data
+
+
+# Taken from
+# https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Dataset.py
+def batchify(data):
+    out, lengths = None, None
+
+    lengths = [x.size(0) for x in data]
+    max_length = max(lengths)
+
+    if data[0].dim() == 1:
+        out = data[0].new(len(data), max_length).fill_(0)
+        for i in range(len(data)):
+            data_length = data[i].size(0)
+            out[i].narrow(0, 0, data_length).copy_(data[i])
+    else:
+        feat_size = data[0].size(1)
+        out = data[0].new(len(data), max_length, feat_size).fill_(0)
+        for i in range(len(data)):
+            data_length = data[i].size(0)
+            out[i].narrow(0, 0, data_length).copy_(data[i])
+
+    return out, lengths
+
+
+def collate_by_input_length(batch, max_seq_len):
+    "Puts each data field into a tensor with outer dimension batch size"
+    if torch.is_tensor(batch[0]):
+        return batchify(batch)
+    elif isinstance(batch[0], int):
+        return torch.LongTensor(batch)
+    else:
+        new_batch = [x for x in batch if x[1].size(0) < max_seq_len]
+        if len(batch) == 0:
+            return (None, None), (None, None), None
+
+        batch = new_batch
+        transposed = zip(*batch)
+        (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers = \
+            [collate_by_input_length(samples, max_seq_len)
+                for samples in transposed]
+
+        # within batch sorting by decreasing length for variable length rnns
+        batch = zip(srcBatch, tgtBatch, tgtLengths, speakers)
+        batch, srcLengths = zip(*sorted(zip(batch, srcLengths),
+                                        key=lambda x: -x[1]))
+        srcBatch, tgtBatch, tgtLengths, speakers = zip(*batch)
+
+        srcBatch = torch.stack(srcBatch, 0).transpose(0, 1).contiguous()
+        tgtBatch = torch.stack(tgtBatch, 0).transpose(0, 1).contiguous()
+        srcLengths = torch.LongTensor(srcLengths)
+        tgtLengths = torch.LongTensor(tgtLengths)
+        speakers = torch.LongTensor(speakers).view(-1, 1)
+
+        return (srcBatch, srcLengths), (tgtBatch, tgtLengths), speakers
+
+    raise TypeError(("batch must contain tensors, numbers, dicts or \
+                     lists; found {}".format(type(batch[0]))))
+
+
+class NpzFolder(data.Dataset):
+    NPZ_EXTENSION = 'npz'
+
+    def __init__(self, root):
+        self.root = root
+        self.npzs = self.make_dataset(self.root)
+
+        if len(self.npzs) == 0:
+            raise(RuntimeError("Found 0 npz in subfolders of: " + root + "\n"
+                               "Supported image extensions are: " +
+                               self.NPZ_EXTENSION))
+
+        self.speakers = []
+        for fname in self.npzs:
+            self.speakers += [os.path.basename(fname).split('_')[0]]
+        self.speakers = list(set(self.speakers))
+        self.speakers.sort()
+        self.speakers = {v: i for i, v in enumerate(self.speakers)}
+
+        code2phone = np.load(self.npzs[0])['code2phone']
+        self.dict = {v: k for k, v in enumerate(code2phone)}
+
+    def __getitem__(self, index):
+        path = self.npzs[index]
+        txt, feat, spkr = self.loader(path)
+
+        return txt, feat, self.speakers[spkr]
+
+    def __len__(self):
+        return len(self.npzs)
+
+    def make_dataset(self, dir):
+        images = []
+
+        for root, _, fnames in sorted(os.walk(dir)):
+            for fname in fnames:
+                if self.NPZ_EXTENSION in fname:
+                    path = os.path.join(root, fname)
+                    images.append(path)
+
+        return images
+
+    def loader(self, path):
+        feat = np.load(path)
+
+        txt = feat['phonemes'].astype('int64')
+        txt = torch.from_numpy(txt)
+
+        audio = feat['audio_features']
+        audio = torch.from_numpy(audio)
+
+        spkr = os.path.basename(path).split('_')[0]
+
+        return txt, audio, spkr
+
+
+class NpzLoader(data.DataLoader):
+    def __init__(self, *args, **kwargs):
+        kwargs['collate_fn'] = partial(collate_by_input_length,
+                                       max_seq_len=kwargs['max_seq_len'])
+        del kwargs['max_seq_len']
+
+        data.DataLoader.__init__(self, *args, **kwargs)
+
+
+class TBPTTIter(object):
+    """
+    Iterator for truncated batch propagation through time(tbptt) training.
+    Target sequence is segmented while input sequence remains the same.
+    """
+    def __init__(self, src, trgt, spkr, seq_len):
+        self.seq_len = seq_len
+        self.start = True
+
+        self.speakers = spkr
+        self.srcBatch = src[0]
+        self.srcLenths = src[1]
+
+        # split batch
+        self.tgtBatch = list(torch.split(trgt[0], self.seq_len, 0))
+        self.tgtBatch.reverse()
+        self.len = len(self.tgtBatch)
+
+        # split length list
+        batch_seq_len = len(self.tgtBatch)
+        self.tgtLenths = [self.split_length(l, batch_seq_len) for l in trgt[1]]
+        self.tgtLenths = torch.stack(self.tgtLenths)
+        self.tgtLenths = list(torch.split(self.tgtLenths, 1, 1))
+        self.tgtLenths = [x.squeeze() for x in self.tgtLenths]
+        self.tgtLenths.reverse()
+
+        assert len(self.tgtLenths) == len(self.tgtBatch)
+
+    def split_length(self, seq_size, batch_seq_len):
+        seq = [self.seq_len] * (seq_size / self.seq_len)
+        if seq_size % self.seq_len != 0:
+            seq += [seq_size % self.seq_len]
+        seq += [0] * (batch_seq_len - len(seq))
+        return torch.LongTensor(seq)
+
+    def __next__(self):
+        if len(self.tgtBatch) == 0:
+            raise StopIteration()
+
+        if self.len > len(self.tgtBatch):
+            self.start = False
+
+        return (self.srcBatch, self.srcLenths), \
+               (self.tgtBatch.pop(), self.tgtLenths.pop()), \
+               self.speakers, self.start
+
+    next = __next__
+
+    def __iter__(self):
+        return self
+
+    def __len__(self):
+        return self.len