堀口翔太
horiguchi [at] ieee.org
NTT人間情報研究所のリサーチスペシャリストで,音声関連技術の研究開発を行っています。
2017年から2024年までは株式会社日立製作所の研究開発グループに所属していました。
2023年に筑波大学から博士号を授与されました。在学中はマルチメディア研究室に所属し,山田武志准教授の指導を受けていました。
2017年3月までは東京大学の相澤・山﨑研究室(現・相澤・山肩・松井研究室)にてコンピュータビジョンの研究を行っており,相澤清晴教授の指導の下,学士号と修士号を取得しました。
新着情報
Aug 31, 2024 | 主著1件&共著1件がSLT 2024に採択されました。 |
---|---|
Jun 8, 2024 | 共著論文がINTERSPEECH 2024に2件採択されました。 |
Feb 1, 2024 | 株式会社日立製作所を退職し,日本電信電話株式会社に入社しました。NTT人間情報研究所にてResearch Specialistとして音声関連技術の研究に従事します。 |
Dec 13, 2023 | ICASSP 2024に論文”Streaming Active Learning for Regression Problems Using Regression via Classification“が採択されました。 |
Sep 14, 2023 | 博士論文が公開されました。 |
代表発表文献
- TASLPOnline Neural Diarization of Unlimited Numbers of Speakers Using Global and Local AttractorsShota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, and Yohei KawaguchiIEEE/ACM Transactions on Audio, Speech, and Language Processing, Jan 2023
A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally the results of each blocks are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.
- TASLPEncoder-Decoder Based Attractors for End-to-End Neural DiarizationShota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Paola GarcíaIEEE/ACM Transactions on Audio, Speech, and Language Processing, Mar 2022🏆 Itakura Prize Innovative Young Researcher Award
This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against cascaded approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional cascaded approach.
- TPAMISignificance of Softmax-based Features in Comparison to Distance Metric Learning-based FeaturesShota Horiguchi, Daiki Ikami, and Kiyoharu AizawaIEEE Transactions on Pattern Analysis and Machine Intelligence, May 2020
End-to-end distance metric learning (DML) has been applied to obtain features useful in many computer vision tasks. However, these DML studies have not provided equitable comparisons between features extracted from DML-based networks and softmax-based networks. In this paper, we present objective comparisons between these two approaches under the same network architecture.
- TMMPersonalized Classifier for Food Image RecognitionShota Horiguchi, Sosuke Amano, Makoto Ogawa, and Kiyoharu AizawaIEEE Transactions on Multimedia, Oct 2018
Currently, food image recognition tasks are evaluated against fixed datasets. However, in real-world conditions, there are cases in which the number of samples in each class continues to increase and samples from novel classes appear. In particular, dynamic datasets in which each individual user creates samples and continues the updating process often has content that varies considerably between different users, and the number of samples per person is very limited. A single classifier common to all users cannot handle such dynamic data. Bridging the gap between the laboratory environment and the real world has not yet been accomplished on a large scale. Personalizing a classifier incrementally for each user is a promising way to do this. In this paper, we address the personalization problem, which involves adapting to the user’s domain incrementally using a very limited number of samples. We propose a simple yet effective personalization framework, which is a combination of the nearest class mean classifier and the 1-nearest neighbor classifier based on deep features. To conduct realistic experiments, we made use of a new dataset of daily food images collected by a food-logging application. Experimental results show that our proposed method significantly outperforms existing methods.