Voice Localization Using Nearby Wall Reflections

3y ago
17 Views
2 Downloads
1.29 MB
14 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Shaun Edmunds
Transcription

Voice Localization Using Nearby Wall ReflectionsSheng ShenDaguan ChenYu-Lin WeiUniversity of Illinois atUrbana-Champaignsshen19@illinois.eduUniversity of Illinois atUrbana-Champaigndaguanc2@illinois.eduUniversity of Illinois atUrbana-Champaignyulinlw2@illinois.eduZhijian YangRomit Roy ChoudhuryUniversity of Illinois atUrbana-Champaignzhijian7@illinois.eduUniversity of Illinois atUrbana-Champaigncroy@illinois.eduABSTRACT1Voice assistants such as Amazon Echo (Alexa) and Google Homeuse microphone arrays to estimate the angle of arrival (AoA) ofthe human voice. This paper focuses on adding user localizationas a new capability to voice assistants. For any voice command,we desire Alexa to be able to localize the user inside the home.The core challenge is two-fold: (1) accurately estimating the AoAsof multipath echoes without the knowledge of the source signal,and (2) tracing back these AoAs to reverse triangulate the user’slocation.We develop VoLoc, a system that proposes an iterative align-andcancel algorithm for improved multipath AoA estimation, followedby an error-minimization technique to estimate the geometry ofa nearby wall reflection. The AoAs and geometric parameters ofthe nearby wall are then fused to reveal the user’s location. Under modest assumptions, we report localization accuracy of 0.44m across different rooms, clutter, and user/microphone locations.VoLoc runs in near real-time but needs to hear around 15 voicecommands before becoming operational.Voice assistants such as Amazon Echo and Google Home continue togain popularity with new “skills” being continuously added to them[6, 9, 22, 38, 60, 69]. A skill coming to Alexa is the ability to inferemotion and age group from the user’s voice commands [6, 9, 38].More of such skills are expected to roll out, aimed at improvingthe contextual background of the human’s voice command. Forinstance, knowing a user’s age may help in retrieving informationfrom the web and personalizing human-machine conversations.Towards enriching multiple dimensions of context-awareness,companies like Amazon, Google, and Samsung are also pursuingthe problem of user localization [8, 12, 35, 43]. Location adds valuable context to the user’s commands, allowing Alexa to resolveambiguities. For instance: (1) Knowing the user’s location couldhelp determining which light the user is referring to, when shesays “turn on the light”. Naming and remembering every light (orfans, thermostats, TVs, and other IoT devices) is known to becomea memory overload for the users [2, 5]. Similarly, Alexa could helpenergy saving in smart buildings if it understands where the useris when she says “increase the temperature”. (2) More broadly, location could aid speech recognition by narrowing down the setof possible commands [4, 46, 62]. If Alexa localizes Jane to thelaundry machine, then a poorly decoded command like “add urgent to groceries” could be correctly resolved to “detergent”. In fact,Google is working on “generating kitchen-specific speech recognition models”, when its voice assistant detects “utterances madein or near kitchens” from the user [31, 75]. (3) Lastly, localizingsounds other than voice – say footsteps or running water – couldfurther enrich context-awareness [7]. Alexa could perhaps remindan independently living grandmother to take her medicine whenshe is walking by the medicine cabinet, or nudge a child when heruns out of the washroom without turning off the faucet.These and other uses of location will emerge over time, and thecorresponding privacy implications will also need attention. In thispaper, however, we focus on exploring the technical viability of theproblem. To this end, let us begin by intuitively understanding thegeneral problem space, followed by the underlying challenges andopportunities.The central question in voice-source localization is that an unknown source signal must be localized from a single (and small)microphone array. Relaxing either one of these two requirementsbrings up rich bodies of past work [24, 28, 37, 63, 66, 78, 79]. Forinstance, a known source signal (such as a training sequence or anCCS CONCEPTS Human-centered computing Ubiquitous and mobile computing; Hardware Signal processing systems; Informationsystems Location based services.KEYWORDSAmazon Alexa, smart home, voice assistant, source localization,microphone array, acoustic reverberation, angle-of-arrival, edgecomputingACM Reference Format:Sheng Shen, Daguan Chen, Yu-Lin Wei, Zhijian Yang, and Romit Roy Choudhury. 2020. Voice Localization Using Nearby Wall Reflections. In The 26thAnnual International Conference on Mobile Computing and Networking (MobiCom ’20), September 21–25, 2020, London, United Kingdom. ACM, NewYork, NY, USA, 14 pages. https://doi.org/10.1145/3372224.3380884Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.MobiCom ’20, September 21–25, 2020, London, United Kingdom 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7085-1/20/09. . . UCTION

MobiCom ’20, September 21–25, 2020, London, United Kingdomimpulse sound) can be localized through channel estimation andfingerprinting [29, 59, 66, 79], while scattered microphone arrayspermit triangulation [24, 25, 37, 63]. However, VoLoc’s aim to localize arbitrary sound signals with a single device essentially inheritsthe worst of both worlds.In surveying the space of solutions, we observe the following:(1) Signal strength based approaches that estimate some form ofdistance are fragile due to indoor multipath. Amplitude variationsacross microphones are also small due to the small size of the microphone array. (2) Machine learning approaches to jointly inferthe in-room reflections and per-user voice models seem extremelydifficult, even if possible. Moreover, such training would impose aprohibitive burden on the users, making it unusable. (3) Perhapsa more viable idea is to leverage the rich body of work in angle ofarrival (AoA). Briefly, AoA is the angular direction from which a signal arrives at a receiver. Voice assistants today already estimate thedirect path’s AoA and beamform towards the user [1, 3, 23, 36]. Soone possibility is to detect additional AoAs for the multipath echoesand trace back the AoA directions to their point of intersection (viareverse triangulation).While the idea of tracing back indoor multipath echoes (such asfrom walls and ceilings) for reverse triangulation is certainly notnew [16], unfortunately, extracting the AoAs for individual echoes,especially indoors, is difficult even in today’s state-of-the-art algorithms [41, 68]. Even the direct path AoA is often erroneous/biasedin today’s systems, and small AoA offsets magnify localization error. Finally, tracing back the AoAs requires the knowledge of thereflectors in the room, a somewhat impractical proposition. This iswhy existing work that leverage multipath reverse triangulationhave assumed empty rooms, known sounds, and even near-fieldeffects [16, 28, 59].While the problem is non-trivial, application-specific opportunities exist: Perhaps not all AoAs are necessary; even two AoAs maysuffice for reverse triangulation, so long as these AoAs areestimated with high accuracy. Of course, the reflector for thesecond AoA is still necessary. To connect to power outlets, Alexa is typically near a wall.If the AoA from the wall can be reliably discriminated fromother echoes, and the wall’s distance and orientation estimated from voice signals, then reverse triangulation may befeasible. Finally, the user’s height can serve as an invariant, constraining the 3D location search space.All in all, these opportunities may give us adequate ammunitionto approach the problem. Thus, the core algorithmic questions boildown to accurate AoA detection and joint wall geometry estimation.These two modules form the technical crux of VoLoc – we discussour core intuitions next. Accurate AoAs: Accurate AoA estimation is difficult in multipath settings because each AoA needs to be extracted from amixture of AoAs, caused by echoes. Existing algorithms try to align(beamform) towards different directions to find the energy maxima, but do not perform well because all the echoes are stronglycorrelated (elaborated in Section 2). We aim to break away fromthis approach, and our central idea is rooted in leveraging (1) slowSheng Shen, Daguan Chen, Yu-Lin Wei, Zhijian Yang, and Romit Roy Choudhuryvelocity, and (2) pauses (or short silences) in acoustic signals. Avoice command, for example, is preceded by silence. The ends ofthese silences are unique opportunities to observe the cascade ofarriving signals, starting with the clean direct path first, followedby the first echo, second echo, and so on. This means that the directpath signal is essentially clean for a short time window, presentingan opportunity to accurately derive its AoA. Since the first echois a delayed version of the direct path, this echo can be modeledand cancelled with appropriate alignment. This process can continue iteratively, and in principle, all AoAs and delays can be jointlyextracted.In practice, hardware noise becomes the limiting factor, hencecancellation errors accrue over time. Thus, VoLoc extracts accurateAoAs and delays for only the initial echoes and utilizes them forsource localization. Wall Geometry Estimation: Inferring source location fromAoA requires geometric knowledge of signal reflectors. To copewith this requirement, existing work have assumed empty roomswith no furniture, and used non-acoustic sensors (such as camerasor depth sensors) to scan the walls and ceilings of the room [16].Our opportunity arises from the fact that the wall near Alexa servesas a stable echo, i.e., it is always present. If the wall’s distance andorientation can be estimated with respect to Alexa, then the echo’sAoA and delay become a function of the user location. This alsohelps in discriminating the wall echo from other echoes, say fromobjects on the table around Alexa. The algorithmic challenge liesin estimating the wall’s ⟨distance, orientation⟩ tuple from the samevoice signals.We address this problem by gathering signals from recent voicecommands and asking the following question: At what distance 𝑑and orientation 𝜃 must a reflector be, such that its echo arrives earlyand is frequently present in voice command signals? We formulatethis as an optimization problem with the error function modeled interms of ⟨𝑑, 𝜃 ⟩. This error is summed across multiple recent voicecommands, and the minimum error yields the ⟨𝑑, 𝜃 ⟩ estimates. Weover-determine the system by fusing AoA, ⟨𝑑, 𝜃 ⟩, and user heightℎ, and converge to the user’s indoor 2D location.We implement VoLoc on an off-the-shelf hardware platform composed of a 6-microphone array, positioned in a circular shape likeAmazon Echo (Figure 1). This was necessary to gain access to rawacoustic signals (commercially available Echo or Google platformsdo not export the raw data). Our microphone array forwards thesignal to a Raspberry Pi, which performs basic signal processingand outputs the data into a flash card, transmitted to our laptop overa WiFi direct interface. Experiment results span across AoA andlocation estimations in various environments, including studentapartments, house kitchen, conference rooms, etc.Our results reveal median localization accuracy of 0.44 m acrossa wide range of environments, including objects scattered aroundthe microphone. In achieving this accuracy, the detected AoAs consistently outperform GCC-PHAT and MUSIC algorithms. VoLoc alsoestimates wall geometry (distance and orientation) with averageaccuracies of 1.2 cm and 1.4 , respectively. The results are robustacross rooms, users, and microphone positions. Current Limitations: We believe that blind voice-source localization remains a challenging problem in practice, and VoLocaddresses it under four geometric assumptions: (1) The user’s height

Voice Localization Using Nearby Wall ReflectionsMobiCom ’20, September 21–25, 2020, London, United Kingdom𝑠(𝑡)6 Microphones𝐷3𝐷2𝑥1 (𝑡)𝐷1θ𝑑Raspberry Pi𝑥3 (𝑡)𝑡𝑥2 (𝑡)𝑡𝑑𝑥2 (𝑡)𝑥3 (𝑡)𝑥1 (𝑡)𝑡Figure 2: A simple 3-element microphone array.Figure 1: Seeed Studio off-the-shelf 6-microphone array, sitting on top of a Raspberry Pi.differences among 𝐷 1 , 𝐷 2 and 𝐷 3 .1 Thus, in general, the receivedsignal vector can be represented as:is known. (2) The line-of-sight (LoS) path exists, meaning that obstructions between the user and the voice assistant do not completely block the signal. (3) The stable reflector is not too far away(so that its reflection is among the first few echoes). (4) The useris within 4 5 m from the device (or else, slight AoA errors translate to large triangulation error). While future work would need torelax these assumptions, we believe the core AoA algorithm andthe wall-geometry estimation in this paper offer an important stepforward. To this end, our contributions may be summarized as: A novel iterative align-and-cancel algorithm that jointly extracts initial AoAs and delays from sound pauses. The technique is generalizable to other applications. An error minimization formulation that jointly estimates thegeometry of a nearby reflector using only the recent voicesignals. A computationally efficient fusion of AoA, wall-reflection,and height to infer indoor 2D human locations.In the following, we expand on each of these contributions, starting with background on AoA.2BACKGROUND AND FORMULATIONThis section presents relevant background for this paper, centeredaround array processing, angle of arrival (AoA), and triangulation.The background will lead into the technical problems and assumptions in VoLoc.2.1Array Processing and AoAFigure 2 shows a simple 3-element linear microphone array with𝑑 distance separation. Assuming no multipath, the source signal𝑠 (𝑡) will arrive at each microphone as 𝑥 1 (𝑡), 𝑥 2 (𝑡) and 𝑥 3 (𝑡), after traveling a distance of 𝐷 1 , 𝐷 2 and 𝐷 3 , respectively. Usually{𝐷1, 𝐷2, 𝐷3} 𝑑, hence these sound waves arrive almost in parallel (known as the far field scenario). From geometry, if the signal’sincoming angle is 𝜃 , then the signal wave needs to travel an extradistance of Δ𝑑 𝑑 cos(𝜃 ) to arrive at microphone 𝑀2 compared to𝑀1 , and an extra 2Δ𝑑 at 𝑀3 compared to 𝑀1 .When the additional travel distance is converted to phase, thephase difference between 𝑥 2 (𝑡) and 𝑥 1 (𝑡) is Δ𝜙 2𝜋𝑑 cos(𝜃 )/𝜆, andbetween 𝑥 3 (𝑡) and 𝑥 1 (𝑡) is 2Δ𝜙. On the other hand, the amplitudesof 𝑥 1 (𝑡), 𝑥 2 (𝑡) and 𝑥 3 (𝑡) will be almost the same, due to very minute 𝑒 𝑗0 𝑥1 𝑥 1 𝑥 2 𝑥 1𝑒 𝑗 Δ𝜙 𝑒 𝑗 Δ𝜙 𝑥1 . . . . 𝑗 (𝑛 1) Δ𝜙 𝑗 (𝑛 1) Δ𝜙 𝑒 𝑥𝑛 𝑥 1𝑒 AoA Estimation without Multipath: In reality, we do notknow the signal’s incoming angle 𝜃 , hence we perform AoA estimation. One solution is to consider every possible 𝜃 , compute thecorresponding Δ𝜙, apply the appropriate negative phase shifts toeach microphone, and add them up to see the total signal energy.The correct angle 𝜃 should present a maximum energy because thesignals will be perfectly aligned, while others would be relativelyweak. This AoA technique essentially has the effect of steering thearray towards different directions of arrival, computing an AoAenergy spectrum, and searching for the maximum peak. For a single sound source under no multipath, this reports the correct AoAdirection. Impact of Multipath Echoes: Now consider the source signal 𝑠 (𝑡) reflecting on different surfaces and arriving with differentdelays from different directions. Each arriving direction is from aspecific value of 𝜃𝑖 , translating to a corresponding phase differenceΔ𝜙𝑖 . Thus the received signal at each microphone (with respectto microphone 𝑀1 ) is a sum of the same source signal, delayedby different phases. With 𝑘 echoes, we can represent the receivedsignal as:𝑒 𝑗0 𝑥 1 𝑥2 𝑒 𝑗 Δ𝜙 1 . . . . 𝑗 (𝑛 1) Δ𝜙 1 𝑥𝑛 𝑒𝑒 𝑗0𝑒 𝑗 Δ𝜙 2.𝑒 𝑗0𝑒 𝑗 Δ𝜙𝑘.𝑒 𝑗 (𝑛 1) Δ𝜙 2.𝑒 𝑗 (𝑛 1) Δ𝜙𝑘 𝑠 1 𝑠 2 . . 𝑠𝑘 Estimating AoA under Multipath: The earlier AoA technique (of searching and aligning across all possible 𝜃𝑖 ) is no longeraccurate since phase compensation for an incorrect 𝜃𝑖 may alsoexhibit strong energy in the AoA spectrum (due to many stronglycorrelated paths). Said differently, searching on 𝜃𝑖 is fundamentallya cross-correlation technique that degrades with lower SNR. Since1 Soundamplitude attenuates with 1/𝑟 where 𝑟 is traveled distance. For two paths of 𝑟and 𝑟 Δ𝑑 , the relative amplitude difference is [1/𝑟 1/(𝑟 Δ𝑑) ] /(1/𝑟 ) Δ𝑑/𝑟 . IfΔ𝑑 2 cm and 𝑟 2 m, there would be a 1% amplitude difference.

MobiCom ’20, September 21–25, 2020, London, United Kingdomany path’s SNR reduces with increasing echoes, AoA estimation isunreliable.While many AoA-variants have been proposed [18, 20, 27, 39,41, 61, 64, 67, 74, 76, 77], most still rely on cross-correlation. Themost popular is perhaps GCC-PHAT [18, 20, 32, 39, 76] which compensates for the amplitude variations across different frequenciesby whitening the signal. The improvement is distinct but does notsolve the root problem of inaccurate alignment. Subspace based algorithms (like MUSIC, ESPRIT, and their variants [61, 64, 67, 72, 74])are also used, but they rely on the assumption that signal paths areuncorrelated or can be fully decorrelated. Multipath echoes exhibitstrong correlation, leaving AoA estimation a still difficult problem.2.2Reverse TriangulationEven if AoAs are estimated correctly, localization would requireknowledge of reflectors in the environment to reverse triangulate(Figure 3). While some past work has scanned the environmentwith depth cameras to create 3D room models [16], this approachis largely impractical for real-world users.Sheng Shen, Daguan Chen, Yu-Lin Wei, Zhijian Yang, and Romit Roy ChoudhuryVoLoc needs to solve the following three sub-problems: Precisely estimate AoA for two signal paths in multipath-richenvironments. Estimate the distance and orientation of at least one reflector,and identify the corresponding AoA for reverse triangulation. Fuse the AoAs, reflector, and height to geometrically inferthe user’s indoor location.The solution needs to be performed without any voice training,must complete in the order of seconds, and must handle clutter inthe environment (such as various objects scattered on the sametable as Alexa).33.1Figure 3: Reverse triangulation requires location of all reflector surfaces, making it impractical.In principle, however, not all echoes are necessary for tracingback to the source location. The direct path’s AoA and one otherAoA would be adequate: say, AoA(1) and AoA(2) in Figure 3. Ofcourse, the location and orientation of AoA(2)’s reflector still needsto be inferred. The authors of [28, 29] have attempted a related problem. They attempt to infer the shape of an empty room; however,they use precisely designed wideband signals, scattered microphonearrays, and essentially solve compute-intensive inverse problems[28, 29]. VoLoc takes on the simpler task of estimating one reflectorposition, but under the more challenging constraint of unknownvoice signals and near real-time deadlines (a few seconds2 ).2.3Problem Statement and AssumptionsWith this background, the problem in this paper can be stated asfollows. Using a 6-microphone array, without any knowledge ofthe source signal, and under the f

the human voice. This paper focuses on adding user localization as a new capability to voice assistants. For any voice command, we desire Alexa to be able to localize the user inside the home. The core challenge is two-fold: (1) accurately estimating the AoAs of multipath echoes without the knowledge of the source signal,

Related Documents:

es the major management issues that are key to localization success and serves as a useful reference as you evolve in your role as Localization Manager. We hope that it makes your job easier and furthers your ability to manage complex localization projects. If the Guide to Localization Management enables you to manage localiza-

Localization processes and best practices will be examined from the perspective of Web developers and translators, and with these considerations in mind, an online localization management tool called Localize1will be evaluated. The process of localization According to Miguel Jiménez-Crespo (2013, 29-31) in his study of Web localization, the

underwater backscatter localization poses new challenges that are different from prior work in RF backscatter localization (e.g., RFID localization [14, 25, 26, 41]). To answer this question, in this section, we provide background on underwater acoustic channels, then explain how these channels pose interesting new challenges for

In the localization of any software including websites and web apps, mobile apps, games, IoT and standalone software, there is no continuous, logical document similar . Localization workflow best practices 04 Localization workflow. Lokalise is a multiplatform system — that means you can store iOS, Android, Web or

Deep Learning based Wireless Localization Localization: Novel learning based approach to solve for the environment dependent localization. Context: Bot that collects both Visual and WiFi data. Dataset: Deployed it in 8 different in a Simple and Complex Environment Results: Shown a 85% improvement compared to state of the art at 90th percentile .

Under "Voice Mail" , you can check and manage your Voice Mail records. 3.1.1 Voice Mail Indicator If there is voice message, there will have an alert in top right hand corner of portal. 3.1.2 Listen Voice Mail Click of the voice message that you want to listen. The voice message will be played by your default Windows Media Player.

The Great Wall of China is beautiful.There is no one who can climb the great wall all the way because when they climb the great wall to high they will feel sick.File Size: 608KBPage Count: 25Explore further[PDF] The Great Wall of China Semantic Scholarwww.semanticscholar.orgGreat Wall of China - HISTORYwww.history.comGreat Wall of China: Length, History, Map, Why & When Built Itwww.chinahighlights.comGreat Wall Of China, History, Facts, Culture & More .sublimechina.comA History of the Great Wall of Chinawww.worldscientific.comRecommended to you based on what's popular Feedback

ANATOMI LUTUT Lutut adalah salah satu sendi terbesar dan paling kompleks dalam tubuh. Sendi ini juga yang paling rentan karena menanggung beban berat dan beban tekanan sekaligus memberikan gerakan yang fleksibel. Ketika berjalan, lutut menopang 1,5 kali berat badan kita, naik tangga sekitar 3–4 kali berat badan kita dan jongkok sekitar 8 kali. Lutut bergabung dengan tulang femur di atasnya .