publications

1st author only

2026

CSL
Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, and Shoko Araki

Computer Speech & Language, Jan 2026

Abs arXiv Bib HTML

In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles a variety of recording conditions, from dinner parties to professional meetings and from two speakers to eight. We perform diarization first, followed by speech enhancement, and then ASR as the challenge baseline. However, we introduced several key refinements. First, we derived a powerful speaker diarization relying on end-to-end speaker diarization with vector clustering (EEND-VC), multi-channel speaker counting using enhanced embeddings from EEND-VC, and target-speaker voice activity detection (TS-VAD). For speech enhancement, we introduced a novel microphone selection rule to better select the most relevant microphones among those distributed microphones and investigated improvements to beamforming. Finally, for ASR, we developed several models exploiting Whisper and WavLM speech foundation models. In this paper, we present the original results we submitted to the challenge and updated results we obtained afterward. Our strongest system achieves a 63% relative macro tcpWER improvement over the baseline and outperforms the challenge best results on the NOTSOFAR-1 meeting evaluation data among geometry-independent systems.
@article{kamo2026microphone, title = {Microphone Array Geometry Independent Multi-Talker Distant {ASR}: {NTT} System for the {DASR} Task of the {CHiME}-8 Challenge}, journal = {Computer Speech \& Language}, author = {Kamo, Naoyuki and Tawara, Naohiro and Ando, Atsushi and Kano, Takatomo and Sato, Hiroshi and Ikeshita, Rintaro and Moriya, Takafumi and Horiguchi, Shota and Matsuura, Kohei and Ogawa, Atsunori and Plaquet, Alexis and Ashihara, Takanori and Ochiai, Tsubasa and Mimura, Masato and Delcroix, Marc and Nakatani, Tomohiro and Asami, Taichi and Araki, Shoko}, volume = {95}, pages = {101820}, year = {2026}, month = jan, }

INTERSPEECH
Tight Boundary Prediction in Speaker Diarization Using Causal–Anticausal Consistency

Shota Horiguchi, Marc Delcroix, Naohiro Tawara, Takanori Ashihara, and Atsushi Ando

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2026

Abs arXiv Bib

Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.
@inproceedings{horiguchi2026tight, title = {Tight Boundary Prediction in Speaker Diarization Using Causal--Anticausal Consistency}, author = {Horiguchi, Shota and Delcroix, Marc and Tawara, Naohiro and Ashihara, Takanori and Ando, Atsushi}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2026}, month = sep, }
ICASSPW
Target-Speaker Voice Activity Detection with Chunk-Level Speaker Queries

Naohiro Tawara and Shota Horiguchi

In IEEE International Conference on Acoustics, Speech and Signal Processing Workshop (ICASSPW), May 2026

Abs Bib HTML Code

Target speaker voice activity detection (TS-VAD) is a powerful approach for refining the outputs of diarization systems by re-estimating each speaker’s activity conditioned on that speaker’s embedding. Conventional TS-VAD methods typically rely on speaker embeddings extracted from the entire session, but this limits the model’s ability to capture intra-session acoustic variety and leads to suboptimal speaker diarization performance. To address this issue, we propose a novel TS-VAD framework that employs fine-grained speaker queries dynamically estimated for each audio chunk. To realize this framework, we explore several architectural variants that integrate self-supervised feature extractors, attention-based conditioning mechanisms, and a two-stage training strategy to jointly enforce session-level consistency while enabling fine-grained chunk-level specification. Evaluated on the DIHARD III corpus, our method improves the state-of-the-art result by the DiariZen initial diarization system.
@inproceedings{tawara2026chunk, title = {Target-Speaker Voice Activity Detection with Chunk-Level Speaker Queries}, author = {Tawara, Naohiro and Horiguchi, Shota}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing Workshop (ICASSPW)}, year = {2026}, month = may, pages = {22202--22206}, }
ICASSP
Frontend Token Enhancement for Token-based Speech Recognition

Takanori Ashihara, Shota Horiguchi, Kohei Matsuura, Tsubasa Ochiai, and Marc Delcroix

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2026

Abs arXiv Bib HTML

Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
@inproceedings{ashihara2026frontend, title = {Frontend Token Enhancement for Token-based Speech Recognition}, author = {Ashihara, Takanori and Horiguchi, Shota and Matsuura, Kohei and Ochiai, Tsubasa and Delcroix, Marc}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2026}, month = may, pages = {17827--17831}, }

2025

ASRU
Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?

Shota Horiguchi, Naohiro Tawara, Takanori Ashihara, Atsushi Ando, and Marc Delcroix

In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2025

Abs arXiv Bib HTML Supp

Neural speaker diarization is widely used for overlap-aware speaker diarization, but it requires large multi-speaker datasets for training. To meet this data requirement, large datasets are often constructed by combining multiple corpora, including those originally designed for multi-speaker automatic speech recognition (ASR). However, ASR datasets often feature loosely defined segment boundaries that do not align with the stricter conventions of diarization benchmarks. In this work, we show that such boundary looseness significantly impacts the diarization error rate, reducing evaluation reliability. We also reveal that models trained on data with varying boundary precision tend to learn dataset-specific looseness, leading to poor generalization across out-of-domain datasets. Training with standardized tight boundaries via forced alignment improves not only diarization performance, especially in streaming scenarios, but also ASR performance when combined with simple post-processing.
@inproceedings{horiguchi2025can, title = {Can We Really Repurpose Multi-Speaker {ASR} Corpus for Speaker Diarization?}, author = {Horiguchi, Shota and Tawara, Naohiro and Ashihara, Takanori and Ando, Atsushi and Delcroix, Marc}, booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, year = {2025}, month = dec, }
INTERSPEECH
Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, and Naohiro Tawara

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

Abs arXiv Bib HTML

Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both low and high overlap ratios, and diarization performance on multiple datasets.
@inproceedings{horiguchi2025mitigating, title = {Mitigating Non-Target Speaker Bias in Guided Speaker Embedding}, author = {Horiguchi, Shota and Ashihara, Takanori and Delcroix, Marc and Ando, Atsushi and Tawara, Naohiro}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2025}, month = aug, pages = {5208--5212}, }
INTERSPEECH
Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, and Yusuke Ijima

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

Abs arXiv Bib HTML

Para-/non-linguistic information in speech is pivotal in shaping the listeners’ impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method’s effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization.
@inproceedings{fujita2025voice, title = {Voice Impression Control in Zero-Shot {TTS}}, author = {Fujita, Kenichi and Horiguchi, Shota and Ijima, Yusuke}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2025}, month = aug, pages = {4363--4367}, }
INTERSPEECH
Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains

Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Kohei Matsuura, and Shota Horiguchi

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

Abs Bib HTML

Techniques for discrete audio representation, which convert an audio signal into a sequence of audio tokens using neural audio codecs or self-supervised speech models, have gained attention for offering the possibility of modeling audio with large language models (LM) efficiently. While these audio tokens have been studied in various domains (e.g., speech, music, and general sound), their encoding properties across domains remain unclear. This paper examines several audio token types to analyze cross-domain variations. Our major findings include that audio tokens exhibit consistent statistical structures and probabilistic predictability deduced from rank-frequency distribution and perplexity, regardless of the domain. However, the token usage pattern is somewhat domain-dependent. This result underpins the steady success of the versatile audio LM, while also suggesting that domain-aware LM could further optimize performance by better capturing domain-specific token usage
@inproceedings{ashihara2025analysis, title = {Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains}, author = {Ashihara, Takanori and Delcroix, Marc and Ochiai, Tsubasa and Matsuura, Kohei and Horiguchi, Shota}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2025}, month = aug, pages = {226--230}, }
INTERSPEECH
Pretraining Multi-Speaker Identification for Neural Speaker Diarization

Shota Horiguchi, Atsushi Ando, Naohiro Tawara, and Marc Delcroix

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2025

Abs arXiv Bib HTML

End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers’ speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which cannot be fully obtained from real datasets alone. To address this issue, large-scale simulated data is often used for pretraining, but it requires enormous storage and I/O capacity, and simulating data that closely resembles real conversations remains challenging. In this paper, we propose pretraining a model to identify multiple speakers from an input fully overlapped mixture as an alternative to pretraining a diarization model. This method eliminates the need to prepare a large-scale simulated dataset while leveraging large-scale speaker recognition datasets for training. Through comprehensive experiments, we demonstrate that the proposed method enables a highly accurate yet lightweight local diarization model without simulated conversational data.
@inproceedings{horiguchi2025pretraining, title = {Pretraining Multi-Speaker Identification for Neural Speaker Diarization}, author = {Horiguchi, Shota and Ando, Atsushi and Tawara, Naohiro and Delcroix, Marc}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2025}, month = aug, pages = {1608--1612}, }
ICASSP
Guided Speaker Embedding

Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, and Marc Delcroix

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

Abs arXiv Bib HTML

This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the latter purpose. Typical speaker embedding extraction approaches only use single-speaker intervals to avoid corrupting the embeddings with speech from interference speakers. However, this often makes speaker embeddings impossible to extract because sufficiently long non-overlapping intervals are not always available. In this paper, we propose using speaker activities as clues to extract the embedding of the speaker-of-interest directly from overlapping speech. Specifically, we concatenate the activity of target and non-target speakers to acoustic features before being fed to the model. We also condition the attention weights used for pooling so that the attention weights of the intervals in which the target speaker is inactive are zero. The effectiveness of the proposed method is demonstrated in speaker verification and speaker diarization.
@inproceedings{horiguchi2025guided, title = {Guided Speaker Embedding}, author = {Horiguchi, Shota and Moriya, Takafumi and Ando, Atsushi and Ashihara, Takanori and Sato, Hiroshi and Tawara, Naohiro and Delcroix, Marc}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2025}, month = apr, }
ICASSP
Multi-channel Speaker Counting for EEND-VC-based Speaker Diarization on Multi-domain Conversation

Naohiro Tawara, Atsushi Ando, Shota Horiguchi, and Marc Delcroix

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

Abs Bib HTML

This paper proposes a speaker counting scheme using multi-channel microphones for end-to-end neural diarization with a vector clustering (EEND-VC) speaker diarization pipeline. The EEND-VC-based system estimates the number of speakers by clustering speaker embeddings from small chunks. However, conventional speaker counting struggles in short sessions with limited available embeddings. We address this issue by leveraging the most possible embeddings from multi-channel signals to increase the number of embeddings. One challenge in using embeddings across channels is the biases caused by channel differences. To mitigate this issue, we extend the EEND-VC pipeline with two modifications: (1) applying speech enhancement before extracting speaker embedding to capture the speaker characteristics even from short chunks and (2) grouping microphones based on inter-channel correlation to perform speaker counting within each group and then aggregating these channel-wise results. The proposed scheme was integrated into our CHiME-8 diarization pipeline, achieving superior speaker counting accuracy compared to the CHiME-8 baseline, with 54.2% and 61.4% improvements in the development and evaluation sets, respectively.
@inproceedings{tawara2025multi, title = {Multi-channel Speaker Counting for {EEND-VC}-based Speaker Diarization on Multi-domain Conversation}, author = {Tawara, Naohiro and Ando, Atsushi and Horiguchi, Shota and Delcroix, Marc}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2025}, month = apr, }
ICASSP
Mamba-based Segmentation Model for Speaker Diarization

Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, and Shoko Araki

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

Abs arXiv Bib HTML Code

Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba’s stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets.
@inproceedings{plaquet2025mamba, title = {Mamba-based Segmentation Model for Speaker Diarization}, author = {Plaquet, Alexis and Tawara, Naohiro and Delcroix, Marc and Horiguchi, Shota and Ando, Atsushi and Araki, Shoko}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2025}, month = apr, }
ICASSP
Alignment-Free Training for Transducer-Based Multi-Talker ASR

Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, and Masato Mimura

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2025

Abs arXiv Bib HTML

Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers’ transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker’s appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers’ speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
@inproceedings{moriya2025alignment, title = {Alignment-Free Training for Transducer-Based Multi-Talker {ASR}}, author = {Moriya, Takafumi and Horiguchi, Shota and Delcroix, Marc and Masumura, Ryo and Ashihara, Takanori and Sato, Hiroshi and Matsuura, Kohei and Mimura, Masato}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2025}, month = apr, }

Preprint
Dissecting the Segmentation Model of End-to-End Diarization with Vector Clustering

Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki, and Hervé Bredin

arXiv:2506.11605, Jun 2025

Abs arXiv Bib

End-to-End Neural Diarization with Vector Clustering is a powerful and practical approach to perform Speaker Diarization. Multiple enhancements have been proposed for the segmentation model of these pipelines, but their synergy had not been thoroughly evaluated. In this work, we provide an in-depth analysis on the impact of major architecture choices on the performance of the pipeline. We investigate different encoders (SincNet, pretrained and finetuned WavLM), different decoders (LSTM, Mamba, and Conformer), different losses (multilabel and multiclass powerset), and different chunk sizes. Through in-depth experiments covering nine datasets, we found that the finetuned WavLM-based encoder always results in the best systems by a wide margin. The LSTM decoder is outclassed by Mamba- and Conformer-based decoders, and while we found Mamba more robust to other architecture choices, it is slightly inferior to our best architecture, which uses a Conformer encoder. We found that multilabel and multiclass powerset losses do not have the same distribution of errors. We confirmed that the multiclass loss helps almost all models attain superior performance, except when finetuning WavLM, in which case, multilabel is the superior choice. We also evaluated the impact of the chunk size on all aforementioned architecture choices and found that newer architectures tend to better handle long chunk sizes, which can greatly improve pipeline performance. Our best system achieved state-of-the-art results on five widely used speaker diarization datasets.
@misc{plaquet2025dissecting, title = {Dissecting the Segmentation Model of End-to-End Diarization with Vector Clustering}, author = {Plaquet, Alexis and Tawara, Naohiro and Delcroix, Marc and Horiguchi, Shota and Ando, Atsushi and Araki, Shoko and Bredin, Herv\'{e}}, year = {2025}, month = jun, howpublished = {arXiv:2506.11605}, }

2024

SLT
Investigation of Speaker Representation for Target-Speaker Speech Processing

Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, and Hiroshi Sato

In IEEE Spoken Language Technology Workshop (SLT), Dec 2024

Abs arXiv Bib HTML

Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker’s speech even when it is corrupted by interference speakers. While most studies have focused on the training schemes or system architectures for each specific task, the auxiliary network for embedding target speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper attempts to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from target speaker identity in the form of a one-hot vector. To further understand the property of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis unveils that 1) speaker verification performance is somewhat unrelated to TS task performances, 2) the one-hot vector outperforms enrollment-based ones, and 3) the optimal embedding depends on the input mixture.
@inproceedings{ashihara2024investigation, title = {Investigation of Speaker Representation for Target-Speaker Speech Processing}, author = {Ashihara, Takanori and Moriya, Takafumi and Horiguchi, Shota and Peng, Junyi and Ochiai, Tsubasa and Delcroix, Marc and Matsuura, Kohei and Sato, Hiroshi}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, year = {2024}, month = dec, pages = {433-440}, }
SLT
Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings

Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, and Marc Delcroix

In IEEE Spoken Language Technology Workshop (SLT), Dec 2024

🏆 Honorable Mention Award @IEEE SLT 2024

Abs arXiv Bib HTML

This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker’s speech without pre-registration in multi-speaker scenarios, most studies on speaker embedding extraction focus on extracting embeddings only from single-speaker recordings. Some methods have been proposed for extracting speaker embeddings directly from multi-speaker recordings, but they typically require preparing a model for each possible number of speakers or involve complicated training procedures. The proposed method computes the embeddings of multiple speakers by focusing on different parts of the frame-wise embeddings extracted from the input multi-speaker audio. This is achieved by recursively computing attention weights for pooling the frame-wise embeddings. Additionally, we propose using the calculated attention weights to estimate the number of speakers in the recording, which allows the same model to be applied to various numbers of speakers. Experimental evaluations demonstrate the effectiveness of the proposed method in speaker verification and diarization tasks.
@inproceedings{horiguchi2024recursive, title = {Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings}, author = {Horiguchi, Shota and Ando, Atsushi and Moriya, Takafumi and Ashihara, Takanori and Sato, Hiroshi and Tawara, Naohiro and Delcroix, Marc}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, year = {2024}, month = dec, pages = {1219--1226}, }
INTERSPEECH
Factor-Conditioned Speaking-Style Captioning

Atsushi Ando, Takafumi Moriya, Shota Horiguchi, and Ryo Masumura

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

Abs arXiv Bib HTML

This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
@inproceedings{ando2024factor, title = {Factor-Conditioned Speaking-Style Captioning}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Ando, Atsushi and Moriya, Takafumi and Horiguchi, Shota and Masumura, Ryo}, year = {2024}, month = sep, pages = {782--786}, }
INTERSPEECH
SpeakerBeam-SS: Real-Time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, and Marc Delcroix

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024

Abs arXiv Bib HTML

Real-time target speaker extraction (TSE) is intended to extract the desired speaker’s voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.
@inproceedings{sato2024speakerbeamss, title = {{SpeakerBeam-SS}: Real-Time Target Speaker Extraction with Lightweight {Conv-TasNet} and State Space Modeling}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Sato, Hiroshi and Moriya, Takafumi and Mimura, Masato and Horiguchi, Shota and Ochiai, Tsubasa and Ashihara, Takanori and Ando, Atsushi and Shinayama, Kentaro and Delcroix, Marc}, year = {2024}, month = sep, pages = {5033--5037}, }
ICASSP
Streaming Active Learning for Regression Problems Using Regression via Classification

Shota Horiguchi, Kota Dohi, and Yohei Kawaguchi

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

Abs arXiv Bib HTML

One of the challenges in deploying a machine learning model is that the model’s performance degrades as the operating environment changes. To maintain the performance, streaming active learning is used, in which the model is retrained by adding a newly annotated sample to the training dataset if the prediction of the sample is not certain enough. Although many streaming active learning methods have been proposed for classification, few efforts have been made for regression problems, which are often handled in the industrial field. In this paper, we propose to use the regression-via-classification framework for streaming active learning for regression. Regression-via-classification transforms regression problems into classification problems so that streaming active learning methods proposed for classification problems can be applied directly to regression problems. Experimental validation on four real data sets shows that the proposed method can perform regression with higher accuracy at the same annotation cost.
@inproceedings{horiguchi2024streaming, title = {Streaming Active Learning for Regression Problems Using Regression via Classification}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author = {Horiguchi, Shota and Dohi, Kota and Kawaguchi, Yohei}, year = {2024}, month = apr, pages = {4955--4959}, }

CHiME
NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Naoyuki Kamo*, Naohiro Tawara*, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, and Shoko Araki

In The 8th International Workshop on Speech Processing in Everyday Environments (CHiME-2024), Sep 2024

(*) Equal contribution

Abs arXiv Bib HTML Poster Slides

We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.4 % on the dev set, which is a 57 % relative improvement over the baseline.
@inproceedings{kamo2024ntt, title = {{NTT} Multi-Speaker {ASR} System for the {DASR} Task of {CHiME-8} Challenge}, author = {Kamo, Naoyuki and Tawara, Naohiro and Ando, Atsushi and Kano, Takatomo and Sato, Hiroshi and Ikeshita, Rintaro and Moriya, Takafumi and Horiguchi, Shota and Matsuura, Kohei and Ogawa, Atsunori and Plaquet, Alexis and Ashihara, Takanori and Ochiai, Tsubasa and Mimura, Masato and Delcroix, Marc and Nakatani, Tomohiro and Asami, Taichi and Araki, Shoko}, booktitle = {The 8th International Workshop on Speech Processing in Everyday Environments (CHiME-2024)}, year = {2024}, month = sep, }

Preprint
Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits

Hiroyuki Namba, Shota Horiguchi, Masaki Hamamoto, and Masashi Egi

arXiv:2402.08209, Feb 2024

Abs arXiv Bib

Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset. Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance; however, it requires training on all subsets of the training data, which is computationally expensive. In this paper, we propose an iterative method to fast identify a subset of instances with low data Shapley values by using the thresholding bandit algorithm. We provide a theoretical guarantee that the proposed method can accurately select harmful instances if a sufficiently large number of iterations is conducted. Empirical evaluation using various models and datasets demonstrated that the proposed method efficiently improved the computational speed while maintaining the model performance.
@misc{namba2024thresholding, title = {Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits}, author = {Namba, Hiroyuki and Horiguchi, Shota and Hamamoto, Masaki and Egi, Masashi}, year = {2024}, month = feb, howpublished = {arXiv:2402.08209}, }

2023

TASLP
Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, and Yohei Kawaguchi

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Jan 2023

🏆 IEEE SPS Japan Young Author Best Paper Award

Abs arXiv Bib HTML

A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally the results of each blocks are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.
@article{horiguchi2023online, title = {Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, author = {Horiguchi, Shota and Watanabe, Shinji and Garcia, Paola and Takashima, Yuki and Kawaguchi, Yohei}, volume = {31}, year = {2023}, month = jan, pages = {706-720}, }

APSIPA ASC
Synthetic Data Augmentation for ASR with Domain Filtering

Tuan Vu Ho, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Takashi Sumiyoshi

In Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Nov 2023

Abs Bib HTML

Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly obtained from in-domain text or generated text using augmentation. However, obtaining large amounts of in-domain text data with diverse lexical contexts is difficult, especially in low-resource scenarios. This paper proposes using text from a large generic-domain source and applying a domain filtering method to choose the relevant text data. This method involves two filtering steps: 1) selecting text based on its semantic similarity to the available in-domain text and 2) diversifying the vocabulary of the selected text using a greedy-search algorithm. Experimental results show that our proposed method outperforms the conventional text augmentation approach, with the relative reduction of word-error-rate ranging from 6% to 25% on the LibriSpeech dataset and 15% on a low-resource Vietnamese dataset.
@inproceedings{ho2023synthetic, title = {Synthetic Data Augmentation for {ASR} with Domain Filtering}, booktitle = {Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}, author = {Ho, Tuan Vu and Horiguchi, Shota and Watanabe, Shinji and Garcia, Paola and Sumiyoshi, Takashi}, year = {2023}, month = nov, pages = {1760-1765}, }
INTERSPEECH
Spoofing Attacker Also Benefits from Large-Scale Self-Supervised Models

Aoi Ito* and Shota Horiguchi*

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

(*) Equal contribution

Abs arXiv Bib HTML

Large-scale pretrained models using self-supervised learning have reportedly improved the performance of speech anti-spoofing. However, the attacker side may also make use of such models. Also, since it is very expensive to train such models from scratch, pretrained models on the Internet are often used, but the attacker and defender may possibly use the same pretrained model. This paper investigates whether the improvement in anti-spoofing with pretrained models holds under the condition that the models are available to attackers. As the attacker, we train a model that enhances spoofed utterances so that the speaker embedding extractor based on the pretrained models cannot distinguish between bona fide and spoofed utterances. Experimental results show that the gains the anti-spoofing models obtained by using the pretrained models almost disappear if the attacker also makes use of the pretrained models.
@inproceedings{ito2023spoofing, title = {Spoofing Attacker Also Benefits from Large-Scale Self-Supervised Models}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Ito, Aoi and Horiguchi, Shota}, year = {2023}, month = aug, pages = {5346--5350}, }
INTERSPEECH
CAPTDURE: Captioned Sound Dataset of Single Sources

Yuki Okamoto, Kanta Shimonishi, Keisuke Imoto, Kota Dohi, Shota Horiguchi, and Yohei Kawaguchi

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023

Abs arXiv Bib HTML Website

In conventional studies on environmental sound separation and synthesis using captions, sound datasets consisting of captions for multiple-source sounds were used for model training. However, in the case of we collect the captions for multiple-source sound, we cannot collect the detailed captions for each sound source. Therefore, it is difficult to extract only the single-source target sound by the model-training method using a conventional captioned sound dataset. We constructed a dataset with captions for a single-source sound that can be used in various tasks that involve environmental sounds, such as environmental sound synthesis. Our dataset consists of 1,044 audio samples and 4,902 captions. We also conducted environmental sound extraction experiments using our dataset and evaluated the performance. The experimental results indicate that the captions for a single-source sound are effective in extracting only the single-source target sound from the mixture sound.
@inproceedings{okamoto2023captdure, title = {{CAPTDURE}: Captioned Sound Dataset of Single Sources}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Okamoto, Yuki and Shimonishi, Kanta and Imoto, Keisuke and Dohi, Kota and Horiguchi, Shota and Kawaguchi, Yohei}, year = {2023}, month = aug, pages = {1683--1687}, }
SLT
Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

Shota Horiguchi, Yuki Takashima, Shinji Watanabe, and Paola García

In IEEE Spoken Language Technology Workshop (SLT), Jan 2023

Abs arXiv Bib HTML

Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately. We first introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs. Using this model, we alternately conduct i) knowledge distillation from a multi-channel model to a single-channel model and ii) finetuning from the distilled single-channel model to a multi-channel model. Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances.
@inproceedings{horiguchi2023mutual, title = {Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, author = {Horiguchi, Shota and Takashima, Yuki and Watanabe, Shinji and Garc{\'i}a, Paola}, year = {2023}, month = jan, pages = {620--625}, }

2022

TASLP
Encoder-Decoder Based Attractors for End-to-End Neural Diarization

Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Paola García

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Mar 2022

🏆 IEEE SPS Young Author Best Paper Award
🏆 Itakura Prize Innovative Young Researcher Award

Abs arXiv Bib HTML

This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against cascaded approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional cascaded approach.
@article{horiguchi2022encoderdecoder, title = {Encoder-Decoder Based Attractors for End-to-End Neural Diarization}, journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing}, author = {Horiguchi, Shota and Fujita, Yusuke and Watanabe, Shinji and Xue, Yawen and Garc{\'i}a, Paola}, volume = {30}, year = {2022}, month = mar, pages = {1493--1507}, }

INTERSPEECH
Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Yohei Kawaguchi

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2022

Abs arXiv Bib HTML

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem, we first investigate which parts of end-to-end ASR models contribute to high accuracy in the target domain while preventing catastrophic forgetting. We conduct experiments on incremental domain adaptation from the LibriSpeech dataset to the AMI meeting corpus with two popular end-to-end ASR models and found that adapting only the linear layers of their encoders can prevent catastrophic forgetting. Then, on the basis of this finding, we develop an element-wise parameter selection focused on specific layers to further reduce the number of fine-tuning parameters. Experimental results show that our approach consistently prevents catastrophic forgetting compared to parameter selection from the whole model.
@inproceedings{takashima2022updating, author = {Takashima, Yuki and Horiguchi, Shota and Watanabe, Shinji and Garcia, Paola and Kawaguchi, Yohei}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, title = {Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End {ASR} Models}, year = {2022}, month = sep, pages = {2218--2222}, }
ICML
Rethinking Fano’s Inequality in Ensemble Learning

Terufumi Morishita, Gaku Morio, Shota Horiguchi, Hiroaki Ozaki, and Nobuo Nukaga

In International Conference on Machine Learning (ICML), Jul 2022

Abs arXiv Bib HTML

We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano’s inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano’s inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
@inproceedings{morishita2022rethinking, author = {Morishita, Terufumi and Morio, Gaku and Horiguchi, Shota and Ozaki, Hiroaki and Nukaga, Nobuo}, booktitle = {International Conference on Machine Learning (ICML)}, title = {Rethinking {Fano}'s Inequality in Ensemble Learning}, year = {2022}, month = jul, pages = {15976--16016}, }
Odyssey
Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization

Natsuo Yamashita, Shota Horiguchi, and Takeshi Homma

In The Speaker and Language Recognition Workshop (Odyssey), Jun 2022

Abs arXiv Bib HTML

This paper investigates a method for simulating natural conversation in the model training of end-to-end neural diarization (EEND). Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset. Simulated datasets play an essential role in the training of EEND, but as yet there has been insufficient investigation into an optimal simulation method. We thus propose a method to simulate natural conversational speech. In contrast to conventional methods, which simply combine the speech of multiple speakers, our method takes turn-taking into account. We define four types of speaker transition and sequentially arrange them to simulate natural conversations. The dataset simulated using our method was found to be statistically similar to the real dataset in terms of the silence and overlap ratios. The experimental results on two-speaker diarization using the CALLHOME and CSJ datasets showed that the simulated dataset contributes to improving the performance of EEND.
@inproceedings{yamashita2022improving, author = {Yamashita, Natsuo and Horiguchi, Shota and Homma, Takeshi}, title = {Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization}, booktitle = {The Speaker and Language Recognition Workshop (Odyssey)}, year = {2022}, month = jun, pages = {133--140}, }
ICASSP
Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Shota Horiguchi, Yuki Takashima, Paola García, Shinji Watanabe, and Yohei Kawaguchi

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022

Abs arXiv Bib HTML

Recent progress on end-to-end neural diarization (EEND) has en-abled overlap-aware speaker diarization with a single neural net-work. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer en-coders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel in-put was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker.
@inproceedings{horiguchi2022multichannel, title = {Multi-Channel End-to-End Neural Diarization with Distributed Microphones}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author = {Horiguchi, Shota and Takashima, Yuki and Garc{\'i}a, Paola and Watanabe, Shinji and Kawaguchi, Yohei}, year = {2022}, month = may, pages = {7332--7336}, }
ICASSP
Environmental Sound Extraction Using Onomatopoeic Words

Yuki Okamoto, Shota Horiguchi, Masaaki Yamamoto, Keisuke Imoto, and Yohei Kawaguchi

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022

🏆 IEEE SPS Japan Student Conference Paper Award

Abs arXiv Bib HTML Website

An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeic words to specify the target sound to be extracted. By this method, we estimate a time-frequency mask from an input mixture spectrogram and an onomatopoeic word using a U-Net architecture, then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to the onomatopoeic word and performs better than conventional methods that use sound-event classes to specify the target sound.
@inproceedings{okamoto2022environmental, title = {Environmental Sound Extraction Using Onomatopoeic Words}, author = {Okamoto, Yuki and Horiguchi, Shota and Yamamoto, Masaaki and Imoto, Keisuke and Kawaguchi, Yohei}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2022}, month = may, pages = {221--225}, }

2021

ASRU
Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

Shota Horiguchi, Paola García, Shinji Watanabe, Yawen Xue, Yuki Takashima, and Yohei Kawaguchi

In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2021

Abs arXiv Bib HTML

Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.
@inproceedings{horiguchi2021towards, title = {Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors}, booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, author = {Horiguchi, Shota and Garc{\'i}a, Paola and Watanabe, Shinji and Xue, Yawen and Takashima, Yuki and Kawaguchi, Yohei}, year = {2021}, month = dec, pages = {98--105}, }
INTERSPEECH
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, and Kenji Nagamatsu

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2021

Abs arXiv Bib HTML

We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a new chunk. However, it only worked with two-speaker recordings. In this paper, we propose an extended STB for flexible numbers of speakers, FLEX-STB. The proposed method uses a zero-padding followed by speaker-tracing, which alleviates the difference in the number of speakers between a buffer and a current chunk. We also examine buffer update strategies to select important frames for tracing multiple speakers. Experiments on CALLHOME and DIHARD II datasets show that the proposed method achieves comparable performance to the offline EEND method with 1-second latency. The results also show that our proposed method outperforms recently proposed chunk-wise diarization methods based on EEND (BW-EDA-EEND).
@inproceedings{xue2021online2, title = {Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers}, author = {Xue, Yawen and Horiguchi, Shota and Fujita, Yusuke and Takashima, Yuki and Watanabe, Shinji and Garcia, Paola and Nagamatsu, Kenji}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, pages = {3116--3120}, year = {2021}, month = sep, }
INTERSPEECH
Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Kenji Nagamatsu

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2021

Abs arXiv Bib HTML

In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each time frame in a recording. In this paper, we explore a pseudo-labeling approach that employs unlabeled data. First, we propose an iterative pseudo-label method for EEND, which trains the model using unlabeled data of a target condition. Then, we also propose a committee-based training method to improve the performance of EEND. To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data. Experimental results on the CALLHOME dataset show that our proposed pseudo-label achieved a 37.4% relative diarization error rate reduction compared to a seed model. Moreover, we analyzed the results of semi-supervised adaptation with pseudo-labeling. We also show the effectiveness of our approach on the third DIHARD dataset.
@inproceedings{takashima2021semisupervised, title = {Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization}, author = {Takashima, Yuki and Fujita, Yusuke and Horiguchi, Shota and Watanabe, Shinji and Garcia, Paola and Nagamatsu, Kenji}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, pages = {3096--3110}, year = {2021}, month = sep, }
ICASSP
End-to-End Speaker Diarization as Post-Processing

Shota Horiguchi, Paola García, Yusuke Fujita, Shinji Watanabe, and Kenji Nagamatsu

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2021

Abs arXiv Bib HTML

This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other’s weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
@inproceedings{horiguchi2021endtoend, title = {End-to-End Speaker Diarization as Post-Processing}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author = {Horiguchi, Shota and Garc{\'i}a, Paola and Fujita, Yusuke and Watanabe, Shinji and Nagamatsu, Kenji}, year = {2021}, month = may, pages = {7188--7192}, }
SLT
End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola Garcia, and Kenji Nagamatsu

In IEEE Spoken Language Technology Workshop (SLT), Jan 2021

Abs Bib HTML

In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate.
@inproceedings{takashima2021endtoend, title = {End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection}, author = {Takashima, Yuki and Fujita, Yusuke and Watanabe, Shinji and Horiguchi, Shota and Garcia, Paola and Nagamatsu, Kenji}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, pages = {849--856}, year = {2021}, month = jan, }
SLT
Online End-to-End Neural Diarization with Speaker-Tracing Buffer

Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Paola Garcia, and Kenji Nagamatsu

In IEEE Spoken Language Technology Workshop (SLT), Jan 2021

Abs arXiv Bib HTML

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker’s permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4 s actual latency.
@inproceedings{xue2021online, title = {Online End-to-End Neural Diarization with Speaker-Tracing Buffer}, author = {Xue, Yawen and Horiguchi, Shota and Fujita, Yusuke and Watanabe, Shinji and Garcia, Paola and Nagamatsu, Kenji}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, pages = {841--848}, year = {2021}, month = jan, }
SLT
Block-Online Guided Source Separation

Shota Horiguchi, Yusuke Fujita, and Kenji Nagamatsu

In IEEE Spoken Language Technology Workshop (SLT), Jan 2021

Abs arXiv Bib HTML

We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applications. It is also a problem that the offline GSS is an utterance-wise algorithm so that it produces latency according to the length of the utterance. With the proposed algorithm, block-wise input samples and corresponding time annotations are concatenated with those in the preceding context and used to update the parameters. Using the context enables the algorithm to estimate time-frequency masks accurately only from one iteration of optimization for each block, and its latency does not depend on the utterance length but predetermined block length. It also reduces calculation cost by updating only the parameters of active speakers in each block and its context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the proposed algorithm achieved almost the same performance as the conventional offline GSS algorithm but with 32x faster calculation, which is sufficient for real-time applications.
@inproceedings{horiguchi2021blockonline, title = {Block-Online Guided Source Separation}, booktitle = {IEEE Spoken Language Technology Workshop (SLT)}, author = {Horiguchi, Shota and Fujita, Yusuke and Nagamatsu, Kenji}, year = {2021}, month = jan, pages = {236--242}, }

DIHARD
The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-vector Clustering Systems Combined by DOVER-Lap

Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, and Sanjeev Khudanpur

In The Third DIHARD Speech Diarization Challenge (DIHARD III), Jan 2021

Abs arXiv Bib PDF Slides

This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge.
@inproceedings{horiguchi2022hitachijhu, title = {The {Hitachi-JHU} {DIHARD} {III} System: Competitive End-to-End Neural Diarization and X-vector Clustering Systems Combined by {DOVER-Lap}}, author = {Horiguchi, Shota and Yalta, Nelson and Garcia, Paola and Takashima, Yuki and Xue, Yawen and Raj, Desh and Huang, Zili and Fujita, Yusuke and Watanabe, Shinji and Khudanpur, Sanjeev}, booktitle = {The Third DIHARD Speech Diarization Challenge (DIHARD III)}, year = {2021}, month = jan, }

2020

TPAMI
Significance of Softmax-based Features in Comparison to Distance Metric Learning-based Features

Shota Horiguchi, Daiki Ikami, and Kiyoharu Aizawa

IEEE Transactions on Pattern Analysis and Machine Intelligence, May 2020

Abs arXiv Bib HTML

End-to-end distance metric learning (DML) has been applied to obtain features useful in many computer vision tasks. However, these DML studies have not provided equitable comparisons between features extracted from DML-based networks and softmax-based networks. In this paper, we present objective comparisons between these two approaches under the same network architecture.
@article{horiguchi2020significance, title = {Significance of Softmax-based Features in Comparison to Distance Metric Learning-based Features}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, author = {Horiguchi, Shota and Ikami, Daiki and Aizawa, Kiyoharu}, volume = {42}, number = {5}, year = {2020}, month = may, pages = {1279--1285}, }

INTERSPEECH
Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

Shota Horiguchi, Yusuke Fujita, and Kenji Nagamatsu

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Oct 2020

Abs Bib HTML

A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without considering sampling frequency mismatch between microphones. Evaluation on our real meeting datasets showed that our framework achieved a character error rate (CER) of 28.7% by using 11 distributed microphones, while a monaural microphone placed on the center of the table had a CER of 38.2%. We also showed that our framework achieved CER of 21.8%, which is only 2.1 percentage points higher than the CER in headset microphone-based transcription.
@inproceedings{horiguchi2020utterancewise, title = {Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Horiguchi, Shota and Fujita, Yusuke and Nagamatsu, Kenji}, pages = {344--348}, year = {2020}, month = oct, }
INTERSPEECH
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji Nagamatsu

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Oct 2020

Abs arXiv Bib HTML Code

End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69% diarization error rate (DER) on simulated mixtures and a 8.07% DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56% and 9.54%, respectively. In unknown numbers of speakers conditions, our method attained a 15.29% DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43% DER.
@inproceedings{horiguchi2020endtoend, title = {End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Horiguchi, Shota and Fujita, Yusuke and Watanabe, Shinji and Xue, Yawen and Nagamatsu, Kenji}, pages = {269--273}, year = {2020}, month = oct, }
ICRA
Anticipating the Start of User Interaction for Service Robot in the Wild

Koichiro Ito, Quan Kong, Shota Horiguchi, Takashi Sumiyoshi, and Kenji Nagamatsu

In IEEE International Conference on Robotics and Automation (ICRA), Jun 2020

Abs Bib HTML

A service robot is expected to provide proactive service for visitors who require its help. In contrast to passive service, e.g., providing service only after being spoken to, proactive service initiates an interaction at an early stage, e.g., talking to potential visitors who need the robot’s help in advance. This paper addresses how to anticipate the start of user interaction. We propose an approach using only a single RGB camera that anticipates whether a visitor will come to the robot for interaction or just pass it by. In the proposed approach, we (i) utilize the visitor’s pose information from captured images incorporating facial information, (ii) train a CNN-LSTM–based model in an end-to-end manner with an exponential loss for early anticipation, and (iii) during the training, the network branch for facial keypoints acquired as the part of the human pose information is taught to mimic the branch trained with the face image from a specialized face detector with a human verification. By virtue of (iii), at the inference, we can run our model in an embedded system processing only the pose information without an additional face detector and typical accuracy drop.We evaluated the proposed approach on our collected real world data with a real service robot and publicly available JPL interaction dataset and found that it achieved accurate anticipation performance.
@inproceedings{ito2020anticipating, title = {Anticipating the Start of User Interaction for Service Robot in the Wild}, author = {Ito, Koichiro and Kong, Quan and Horiguchi, Shota and Sumiyoshi, Takashi and Nagamatsu, Kenji}, booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}, year = {2020}, month = jun, pages = {9687--9693}, }

SemEval
Hitachi at SemEval-2020 Task 8: Simple but Effective Modality Ensemble for Meme Emotion Recognition

Terufumi Morishita*, Gaku Morio*, Shota Horiguchi, Hiroaki Ozaki, and Toshinori Miyoshi

In The Forteenth Workshop on Semantic Evaluation (SemEval), Dec 2020

(*) Equal contribution

Abs Bib HTML

Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them. SemEval-2020 task 8, Memotion Analysis, aims at automatically recognizing these emotions of so-called internet memes. In this paper, we propose a simple but effective Modality Ensemble that incorporates visual and textual deep-learning models, which are independently trained, rather than providing a single multi-modal joint network. To this end, we first fine-tune four pre-trained visual models (i.e., Inception-ResNet, PolyNet, SENet, and PNASNet) and four textual models (i.e., BERT, GPT-2, Transformer-XL, and XLNet). Then, we fuse their predictions with ensemble methods to effectively capture cross-modal correlations. The experiments performed on dev-set show that both visual and textual features aided each other, especially in subtask-C, and consequently, our system ranked 2nd on subtask-C.
@inproceedings{morishita2020hitachi, title = {Hitachi at {SemEval-2020} Task 8: Simple but Effective Modality Ensemble for Meme Emotion Recognition}, author = {Morishita, Terufumi and Morio, Gaku and Horiguchi, Shota and Ozaki, Hiroaki and Miyoshi, Toshinori}, booktitle = {The Forteenth Workshop on Semantic Evaluation (SemEval)}, year = {2020}, month = dec, pages = {1126–-1134}, }
CHiME
CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, and Neville Ryant

In The 6th International Workshop on Speech Processing in Everyday Environments (CHiME-2020), May 2020

Abs arXiv Bib HTML

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
@inproceedings{watanabe2020chime6, title = {{CHiME-6} {Challenge}: Tackling Multispeaker Speech Recognition for Unsegmented Recordings}, author = {Watanabe, Shinji and Mandel, Michael and Barker, Jon and Vincent, Emmanuel and Arora, Ashish and Chang, Xuankai and Khudanpur, Sanjeev and Manohar, Vimal and Povey, Daniel and Raj, Desh and Snyder, David and Subramanian, Aswin Shanmugam and Trmal, Jan and Yair, Bar Ben and Boeddeker, Christoph and Ni, Zhaoheng and Fujita, Yusuke and Horiguchi, Shota and Kanda, Naoyuki and Yoshioka, Takuya and Ryant, Neville}, booktitle = {The 6th International Workshop on Speech Processing in Everyday Environments (CHiME-2020)}, year = {2020}, month = may, pages = {1--7}, }

Preprint
Neural Speaker Diarization with Speaker-Wise Chain Rule

Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, and Nagamatsu Kenji

arXiv:2006.01796, Jun 2020

Abs arXiv Bib

Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each speaker’s speech activity is regarded as a single random variable, and is estimated sequentially conditioned on previously estimated other speakers’ speech activities. Similar to other sequence-to-sequence models, the proposed method produces a variable number of speakers with a stop sequence condition. We evaluated the proposed method on multi-speaker audio recordings of a variable number of speakers. Experimental results show that the proposed method can correctly produce diarization results with a variable number of speakers and outperforms the state-of-the-art end-to-end speaker diarization methods in terms of diarization error rate.
@misc{fujita2020neural, title = {Neural Speaker Diarization with Speaker-Wise Chain Rule}, author = {Fujita, Yusuke and Watanabe, Shinji and Horiguchi, Shota and Xue, Yawen and Shi, Jing and Kenji, Nagamatsu}, year = {2020}, month = jun, howpublished = {arXiv:2006.01796}, }
Preprint
End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-Label Classification

Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, and Nagamatsu Kenji

arXiv:2003.20966, Feb 2020

Abs arXiv Bib

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.
@misc{fujita2020endtoend, title = {End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-Label Classification}, author = {Fujita, Yusuke and Watanabe, Shinji and Horiguchi, Shota and Xue, Yawen and Kenji, Nagamatsu}, year = {2020}, month = feb, howpublished = {arXiv:2003.20966}, }

2019

ASRU
Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, and Shinji Watanabe

In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2019

Abs arXiv Bib HTML

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.
@inproceedings{kanda2019simultaneous, title = {Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models}, author = {Kanda, Naoyuki and Horiguchi, Shota and Fujita, Yusuke and Xue, Yawen and Nagamatsu, Kenji and Watanabe, Shinji}, booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages = {31--38}, year = {2019}, month = dec, }
ASRU
End-to-End Neural Speaker Diarization with Self-Attention

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, and Shinji Watanabe

In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2019

Best Paper Finalist

Abs arXiv Bib HTML Code

Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.
@inproceedings{fujita2019endtoend, title = {End-to-End Neural Speaker Diarization with Self-Attention}, author = {Fujita, Yusuke and Kanda, Naoyuki and Horiguchi, Shota and Xue, Yawen and Nagamatsu, Kenji and Watanabe, Shinji}, booktitle = {IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages = {296--303}, year = {2019}, month = dec, }
INTERSPEECH
End-to-End Neural Speaker Diarization with Permutation-Free Objectives

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, and Shinji Watanabe

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019

Abs arXiv Bib HTML Code

In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset.
@inproceedings{fujita2019endtoend2, title = {End-to-End Neural Speaker Diarization with Permutation-Free Objectives}, author = {Fujita, Yusuke and Kanda, Naoyuki and Horiguchi, Shota and Nagamatsu, Kenji and Watanabe, Shinji}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2019}, month = sep, pages = {4300--4304}, }
INTERSPEECH
Multimodal Response Obligation Detection with Unsupervised Online Domain Adaptation

Shota Horiguchi, Naoyuki Kanda, and Kenji Nagamatsu

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019

Abs Bib HTML

Response obligation detection, which determines whether a dialogue robot has to respond to a detected utterance, is an important function for intelligent dialogue robots. Some studies have tackled this problem; however, they narrow their applicability by impractical assumptions or use of scenario-specific features. Some attempts have been made to widen the applicability by avoiding the use of text modality, which is said to be highly domain dependent, but it decreases the detection accuracy. In this paper, we propose a novel multimodal response obligation detector, which uses visual, audio, and text information for highly-accurate detection, with its unsupervised online domain adaptation to solve the domain dependency problem. Our domain adaptation consists of the weights adaptation of the logistic regression for every modality and an embedding assignment for new words to cope with the high domain dependency of text modality. Experimental results on the dataset collected at a station and commercial building showed that our method achieved high response obligation detection accuracy and was able to handle domain change automatically.
@inproceedings{horiguchi2019multimodal, title = {Multimodal Response Obligation Detection with Unsupervised Online Domain Adaptation}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, author = {Horiguchi, Shota and Kanda, Naoyuki and Nagamatsu, Kenji}, pages = {4180--4184}, year = {2019}, month = sep, }
INTERSPEECH
Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, and Shinji Watanabe

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019

Abs arXiv Bib HTML

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker’s utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training. This will regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. We evaluated our proposed method using two-speaker-mixed speech in various signal-to-interference-ratio conditions. We first built a strong target-speaker ASR baseline based on the state-of-the-art lattice-free maximum mutual information. This baseline achieved a word error rate (WER) of 18.06% on the test set while a normal ASR trained with clean data produced a completely corrupted result (WER of 84.71%). Then, our proposed loss further reduced the WER by 6.6% relative to this strong baseline, achieving a WER of 16.87%. In addition to the accuracy improvement, we also showed that the auxiliary output branch for the proposed loss can even be used for a secondary ASR for interference speakers’ speech.
@inproceedings{kanda2019auxiliary, title = {Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition}, author = {Kanda, Naoyuki and Horiguchi, Shota and Takashima, Ryoichi and Fujita, Yusuke and Nagamatsu, Kenji and Watanabe, Shinji}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2019}, month = sep, pages = {236--240}, }
INTERSPEECH
Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party Scenario

Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, and Reinhold Haeb-Umbach

In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019

Abs arXiv Bib HTML

In this paper, we present Hitachi and Paderborn University’s joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As an example of a dinner party scenario, we have chosen the data presented during the CHiME-5 speech recognition challenge, where the baseline ASR had a 73.3% word error rate (WER), and even the best performing system at the CHiME-5 challenge had a 46.1% WER. We extensively investigated a combination of the guided source separation-based speech enhancement technique and an already proposed strong ASR backend and found that a tight combination of these techniques provided substantial accuracy improvements. Our final system achieved WERs of 39.94% and 41.64% for the development and evaluation data, respectively, both of which are the best published results for the dataset. We also investigated with additional training data on the official small data in the CHiME-5 corpus to assess the intrinsic difficulty of this ASR task.
@inproceedings{kanda2019guided, title = {Guided Source Separation Meets a Strong {ASR} Backend: {Hitachi/Paderborn University} Joint Investigation for Dinner Party Scenario}, author = {Kanda, Naoyuki and Boeddeker, Christoph and Heitkaemper, Jens and Fujita, Yusuke and Horiguchi, Shota and Nagamatsu, Kenji and Haeb-Umbach, Reinhold}, booktitle = {The Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year = {2019}, month = sep, pages = {1248--1252}, }
ICASSP
Acoustic Modeling for Distant Multi-Talker Speech Recognition with Single- and Multi-Channel Branches

Naoyuki Kanda, Yusuke Fujita, Shota Horiguchi, Rintaro Ikeshita, Kenji Nagamatsu, and Shinji Watanabe

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019

Abs Bib HTML

This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method’s effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.
@inproceedings{kanda2019acoustic, title = {Acoustic Modeling for Distant Multi-Talker Speech Recognition with Single- and Multi-Channel Branches}, author = {Kanda, Naoyuki and Fujita, Yusuke and Horiguchi, Shota and Ikeshita, Rintaro and Nagamatsu, Kenji and Watanabe, Shinji}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year = {2019}, month = may, pages = {6630--6634}, }
WACV
Omnidirectional Pedestrian Detection by Rotation Invariant Training

Masato Tamura, Shota Horiguchi, and Tomokazu Murakami

In IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2019

Abs Bib HTML Supp

Recently much progress has been made in pedestrian detection by utilizing the learning ability of convolutional neural networks (CNNs). However, due to the lack of omnidirectional images to train CNNs, few CNN-based detectors have been proposed for omnidirectional pedestrian detection. One significant difference between omnidirectional images and perspective images is that the appearance of pedestrians is rotated in omnidirectional images. A previous method has dealt with this by transforming omnidirectional images into perspective images in the test phase. However, this method has significant drawbacks, namely, the computational cost and the performance degradation caused by the transformation. To address this issue, we propose a rotation invariant training method, which only uses randomly rotated perspective images without any additional annotation. By this method, existing large-scale datasets can be utilized. In test phase, omnidirectional images can be used without the transformation. To group predicted bounding boxes, we also develop a bounding box refinement, which works better for our detector than non-maximum suppression. The proposed detector achieved a state-of-the-art performance on four public benchmarks.
@inproceedings{tamura2019omnidirectional, title = {Omnidirectional Pedestrian Detection by Rotation Invariant Training}, booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)}, author = {Tamura, Masato and Horiguchi, Shota and Murakami, Tomokazu}, pages = {1989--1998}, year = {2019}, month = jan, }

2018

TMM
Personalized Classifier for Food Image Recognition

Shota Horiguchi, Sosuke Amano, Makoto Ogawa, and Kiyoharu Aizawa

IEEE Transactions on Multimedia, Oct 2018

Abs arXiv Bib HTML

Currently, food image recognition tasks are evaluated against fixed datasets. However, in real-world conditions, there are cases in which the number of samples in each class continues to increase and samples from novel classes appear. In particular, dynamic datasets in which each individual user creates samples and continues the updating process often has content that varies considerably between different users, and the number of samples per person is very limited. A single classifier common to all users cannot handle such dynamic data. Bridging the gap between the laboratory environment and the real world has not yet been accomplished on a large scale. Personalizing a classifier incrementally for each user is a promising way to do this. In this paper, we address the personalization problem, which involves adapting to the user’s domain incrementally using a very limited number of samples. We propose a simple yet effective personalization framework, which is a combination of the nearest class mean classifier and the 1-nearest neighbor classifier based on deep features. To conduct realistic experiments, we made use of a new dataset of daily food images collected by a food-logging application. Experimental results show that our proposed method significantly outperforms existing methods.
@article{horiguchi2018personalized, title = {Personalized Classifier for Food Image Recognition}, journal = {IEEE Transactions on Multimedia}, author = {Horiguchi, Shota and Amano, Sosuke and Ogawa, Makoto and Aizawa, Kiyoharu}, volume = {20}, number = {10}, year = {2018}, month = oct, pages = {2836--2848}, }

ACMMM
Face-Voice Matching Using Cross-Modal Embeddings

Shota Horiguchi, Naoyuki Kanda, and Kenji Nagamatsu

In ACM International Conference on Multimedia (ACMMM), Oct 2018

Abs Bib HTML

Face-voice matching is a task to find correspondence between faces and voices. Many researches in cognitive science have confirmed human ability in the face-voice matching tasks. Such ability is useful for creating natural human machine interaction systems and in many other applications. In this paper, we propose a face-voice matching model that learns cross-modal embeddings between face images and voice characteristics. We constructed a novel FVCeleb dataset which consists of face images and utterances from 1,078 persons. These persons were selected from the MS-Celeb-1M face image dataset and the VoxCeleb audio dataset. In two-alternative forced-choice matching task with an audio input and two face-image candidates of the same gender, our model achieved 62.2% and 56.5% accuracy on the FVCeleb and the subset of the GRID corpus, respectively. These results are very similar to human performance reported in cognitive science studies.
@inproceedings{horiguchi2018facevoice, title = {Face-Voice Matching Using Cross-Modal Embeddings}, booktitle = {ACM International Conference on Multimedia (ACMMM)}, author = {Horiguchi, Shota and Kanda, Naoyuki and Nagamatsu, Kenji}, pages = {1011--1019}, year = {2018}, month = oct, }

CHiME
The Hitachi/JHU CHiME-5 System: Advances in Speech Recognition for Everyday Home Environments Using Multiple Microphone Arrays

Naoyuki Kanda, Rintaro Ikeshita, Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu, Xiaofei Wang, Vimal Manohar, Nelson Enrique Yalta Soplin, Matthew Maciejewski, Szu-Jui Chen, Aswin Shanmugam Subramanian, Ruizhi Li, Zhiqi Wang, Jason Naradowsky, L. Paola Garcia-Perera, and Gregory Sell

In The 5th International Workshop on Speech Processing in Everyday Environments (CHiME-2018), Sep 2018

Abs Bib HTML Slides

This paper presents Hitachi and JHU’s efforts on developing CHiME-5 system to recognize dinner party speeches recorded by multiple microphone arrays. We newly developed (1) the way to apply multiple data augmentation methods, (2) residual bidirectional long short-term memory, (3) 4-ch acoustic models, (4) multiple-array combination methods, (5) hypothesis deduplication method, and (6) speaker adaptation technique of neural beamformer. As the results, our best system in category B achieved 52.38% of word error rates (WERs) for development set, which corresponded to 35% of relative WER reduction from the state-of-the-art baseline. Our best system also achieved 48.20% of WER for evaluation set, which was the 2nd best result in the CHiME-5 competition.
@inproceedings{kanda2018hitachijhu, title = {The {Hitachi/JHU} {CHiME-5} System: Advances in Speech Recognition for Everyday Home Environments Using Multiple Microphone Arrays}, author = {Kanda, Naoyuki and Ikeshita, Rintaro and Horiguchi, Shota and Fujita, Yusuke and Nagamatsu, Kenji and Wang, Xiaofei and Manohar, Vimal and {Yalta Soplin}, Nelson Enrique and Maciejewski, Matthew and Chen, Szu-Jui and Subramanian, Aswin Shanmugam and Li, Ruizhi and Wang, Zhiqi and Naradowsky, Jason and Garcia-Perera, L. Paola and Sell, Gregory}, booktitle = {The 5th International Workshop on Speech Processing in Everyday Environments (CHiME-2018)}, year = {2018}, month = sep, pages = {6--10}, }

2016

Food Search Based on User Feedback to Assist Image-Based Food Recording Systems

Sosuke Amano, Shota Horiguchi, Kiyoharu Aizawa, Kazuki Maeda, Masanori Kubota, and Makoto Ogawa

In International Workshop On Multimedia Assisted Dietary Management (MADiMa), Oct 2016

Abs Bib HTML

Food diaries or diet journals are thought to be effective for improving the dietary lives of users. One important challenge in this field involves assisting users in recording their daily food intake. In recent years, food image recognition has attracted a considerable amount of research interest as a new technology to help record users ’food intake. However, since there are so many types of food, and it is unrealistic to expect a system to recognize all foods. In this paper, we propose an optimal combination of image recognition and interactive search in order to record users ’intake of food. The image recognition generates a list of candidate names for a given food picture. The user chooses the closest name to the meal, which triggers an associative food search based on food contents, such as ingredients. We show the proposed system is efficient to assist users maintain food journals.
@inproceedings{amano2016food, title = {Food Search Based on User Feedback to Assist Image-Based Food Recording Systems}, author = {Amano, Sosuke and Horiguchi, Shota and Aizawa, Kiyoharu and Maeda, Kazuki and Kubota, Masanori and Ogawa, Makoto}, booktitle = {International Workshop On Multimedia Assisted Dietary Management (MADiMa)}, pages = {71--75}, year = {2016}, month = oct, }
ICIP
The Log-Normal Distribution of the Size of Objects in Daily Meal Images and Its Application to the Efficient Reduction of Object Proposals

Shota Horiguchi, Kiyoharu Aizawa, and Makoto Ogawa

In IEEE International Conference on Image Processing (ICIP), Sep 2016

Abs Bib HTML

In general, object-detection methods apply classifiers to pre-calculated object proposals. It is therefore important to minimize the number of proposals to achieve computational efficiency. In this paper, we show that the region size for food objects in recorded images of daily food follows a lognormal distribution, which is different from the distribution for widely used datasets collected by querying the names of dishes. We explain this characteristic using Gibrat’s law, and construct a model for the region-size distribution of objects in images. We applied the model to the filtering of object proposals generated by selective search and edge boxes. We obtained a significant reduction of 40.6% in the number of hypotheses compared with a conventional selective search, despite a decrease of only 0.007 in the Mean Average Best Overlap.
@inproceedings{horiguchi2016lognormal, title = {The Log-Normal Distribution of the Size of Objects in Daily Meal Images and Its Application to the Efficient Reduction of Object Proposals}, booktitle = {IEEE International Conference on Image Processing (ICIP)}, author = {Horiguchi, Shota and Aizawa, Kiyoharu and Ogawa, Makoto}, pages = {3668--3672}, year = {2016}, month = sep, }