Can you please …?”

Speech intel­li­gi­bil­ity and room acoustics of a bar approached by an impulse response measurement

This exper­i­ment was done as a student’s admis­sion for the class Sig­nals and Sys­tems 2 at Royal Con­ser­va­tory, The Hague in Feb­ru­ary 2012. It was exe­cuted by Kathrin Gren­zdörf­fer, MA 1st year.

goal

There is a bar where work­ers some­times claim it is dif­fi­cult to com­mu­ni­cate among each other or to under­stand orders by clients. This is said to espe­cially occur when more than 20 peo­ple are in the bar and the music is at a low level. Then the room seems to be filled with con­stant mum­bling and chat­ter­ing. Speech sounds of guests sum up but intel­li­gi­ble com­mu­ni­ca­tion can be dif­fi­cult. To drown the first effect, in most cases the music gets turned louder. Acousti­cally spo­ken this some­times leads to a vicious cir­cle as guests start to talk louder and then again the music is adjusted and so on. Plus, when the music is turned rather loud, the room starts to res­onate and low fre­quen­cies get blurry.

To give an indi­ca­tion of what exactly may cause these impres­sions, an acoustic mea­sure­ment was con­ducted in the very room. There­fore, an impulse response was mea­sured employ­ing for­ward and reverse sweeps. Salient reflec­tions are related back to actual fea­tures of the space itself. Speech intel­li­gi­bil­ity is con­sid­ered to be a desir­able fea­ture of the space as clients as well as the work­ers pre­fer to lis­ten and speak effortlessly.

hypoth­e­sis:
Due to a) a long rever­ber­a­tion time for fre­quen­cies below 200 Hz the room has a low-pass fil­ter char­ac­ter­is­tic and b) suf­fers from a lack of speech intelligibility.

In other words, the find­ings will have to show sig­nif­i­cant reflec­tions in the lower part of the spec­trum. For the range essen­tial to speech trans­mis­sion (200 — 3600 Hz)* the find­ings will have to reveal a sig­nif­i­cantly irreg­u­lar ampli­tude response.

The room is assumed to be a lin­ear system.

*The fre­quency range used in tele­phone lines sug­gests that human speech is suf­fi­ciently com­pre­hen­si­ble when trans­mit­ted between 200 and 3600 Hz. In the fol­low­ing sec­tions I will con­sider this range to be essential.

setup

hard­ware
- micro­phone: Shure SM 58
– pas­sive loud­speaker*: self-made by EWP Konin­klijk Con­ser­va­to­rium
– ampli­fier: AKAI AM-U110
– inter­face: Focus­rite Saf­fire Pro 8
- com­puter: Mac­Book Intel Core 2 Duo

spec­i­fi­ca­tions were not avail­able. sin­gle mem­brane diam­e­ter 17 cm, height 35 cm, diag­o­nal top 32 cm, metal grid in front of mem­brane, body of wood

soft­ware
- Logic Pro 9 to record and save as WAV for­mat
– Praat ver­sion 5.3.03 to gen­er­ate sweeps, inverted sweeps, to per­form con­vo­lu­tion, and to do spec­tral analyses

sweeps
The sweep in Praat was gen­er­ated using the fol­low­ing script pro­vided by Peter Pabon (2012).

startFreq=10

endFreq=12800
lnRange=ln(endFreq/startFreq)
periodT=10
multFact=lnRange/periodT
Cre­ate Sound… Exp­Sweep 0 peri­odT 44100 exp(multFact*0.5*(x-periodT))*sin(2*pi*startFreq/multFact*exp(multFact*x))

Con­se­quently, the ini­tial sweep sounds like this:

Exp­Sweep

Con­volved with itself being inverted, a pulse can be obtained with a flat ampli­tude spectrum.


the space

The bar is sit­u­ated at a T-shaped cross­roads with cars, trams and pas­sen­gers pass­ing from two sides.

schematic floor plan and the three loca­tions of measurement

schematic slice of the location


set­ting

To com­pare dif­fer­ent sit­u­a­tions for acoustic com­mu­ni­ca­tion, three dif­fer­ent mea­sure­ments were per­formed. In two of them speaker and micro­phone served as dum­mies for human com­mu­ni­ca­tion — the loud­speaker tak­ing the place of a per­son talk­ing and the micro­phone mim­ic­k­ing the ears of a lis­tener. The dis­tances between the two were cho­sen accord­ing to sit­u­a­tions in real-life communication.

A third mea­sure­ment was done to pro­vide an over­all account of the acoustics in the room and to com­pare it to the other results. It was per­formed at the longest pos­si­ble dis­tance between micro­phone and speaker.

prob­lems encountered

(1) As the venue is sit­u­ated at a cross­roads on the way to Rot­ter­dam main sta­tion, cars and trams fre­quently pass by. This makes the room lit­er­ally vibrate, which might be a con­sid­er­able bias for the results. Yet to the student’s knowl­edge the mea­sure­ments were taken only when it was silent on the street.

(2) Cool­ing devices had to stay switched on dur­ing the mea­sure­ment. As this white-noisy sound appeared to be per­ma­nent, it was neglected as a pos­si­bly impair­ing fac­tor for the lin­ear­ity of the system.

(3) Despite my tests prior to the mea­sure­ment, dur­ing the con­duc­tion of the exper­i­ment dis­tor­tion occurred in the loud­speaker at 4.19 s of the ini­tial sweep. This cor­re­sponds to a fre­quency band of 110 Hz — 220 Hz. Thus, the results for this fre­quency can­not be taken into account.

results and analysis

pro­ce­dure

The expo­nen­tial sinu­soidal sweep as recorded with the help of a micro­phone was con­volved with the ini­tial sweep being reversed. This resulted in an impulse-like sig­nal that con­tains the fre­quency char­ac­ter­is­tics of the room.

To elab­o­rate the goals that were men­tioned ear­lier in the text, an analy­sis of a fre­quency range between 5 to 12000 Hz is suf­fi­cient to describe the system.

The record­ings and spec­tra of the impulse responses sound and look as follows.

1 The “Do it now” — scenario

This is a typ­i­cal sit­u­a­tion how work­ers com­mu­ni­cate. One of them would stand on the left next to the cash box, turn­ing his head towards his col­league in the kitchen (on the right in the pic­ture). Often when the music is on, the per­son in the kitchen can hardly hear what is being said.

1.1.1 sound file of recorded sweep
1 recorded sweep

1.1.2 recorded sweep visu­al­ized

1.2 impulse response

1.2.1 sound file
1 impulse response

1.2.2 visu­al­iza­tion of the impulse response

The longest rever­ber­a­tion time is 40 ms at 3240 Hz, the sec­ond longest of 35 ms at 3920 Hz.

1.2.3 spec­tral slice of impulse response

Sig­nif­i­cant in this pic­ture is a slope start­ing at around 4400 Hz and decreas­ing con­tin­u­ously.
Peaks in the lower fre­quency band are at 110 Hz, 238 Hz but they have to be neglected in the fur­ther pro­ceed­ings as they are very likely a result of the speaker’s dis­tor­tion. A real peak can be found at 550 Hz. Between 2040 and 2900 Hz a dip is notice­able, being fol­lowed by a remark­able atten­u­a­tion until 4400 Hz. The atten­u­a­tion cor­re­sponds to the longest rever­ber­a­tion time which is 3240 Hz.

2 The “Repeat or I’ll kneel down” — scenario

This setup sim­u­lates com­mu­ni­ca­tion between a waiter and a guest. In real-life sit­u­a­tions (when the music is on), every so often wait­ers either ask clients to repeat their orders because the wait­ers could not hear it or the staff imme­di­ately crouches in front of the table while lis­ten­ing to the order being put.

2.1.1 sound file of recorded sweep
2 recorded sweep

2.1.2 recorded sweep visu­al­ized

2.2 impulse response

2.2.1 sound file of impulse response
2 impulse response

2.2.2 visu­al­iza­tion

Again, a long rever­ber­a­tion time is notice­able between 2500 and 6900 Hz, reach­ing a peak of 0.54 sec­onds at about 3500 Hz.

2.2.3 spec­tral slice

At 84 and 282 Hz peaks occur which per­fectly cor­re­spond to the microphone’s dis­tance to ceil­ing and table. As the speaker dis­torted, peaks at 135 Hz and 166 Hz can­not be ana­lyzed. A dip is notice­able between 1800 and 3080 Hz with an esti­mated cen­ter fre­quency of 2600 Hz. Fre­quen­cies between 300 and 3600 Hz get trans­mit­ted very well. A slope for high fre­quen­cies occurs at 5055 Hz, with an ampli­tude drop­ping from 5.9 dB con­tin­u­ously until — 44 dB at 11900 Hz.

3 — Cen­tered posi­tion
For the cen­tered set-up of the micro­phone it was put in a place with a dis­tance as equal as pos­si­ble to all the walls.

3.1.1 sound file of recorded sweep
3 recorded sweep

3.1.2 recorded sweep visu­al­ized

3.2 impulse response

3.2.1 sound file
3 impulse response

3.2.1 visu­al­iza­tion

At a band of 2800 Hz to 4150 Hz the longest rever­ber­a­tion time can be found, most promi­nent is 3180 Hz with a length of 70 ms.

3.2.2 spec­tral slice

High sound pres­sure can be stated for 156, 176 but it will be neglected for the rea­sons men­tioned above. True peaks are at 497, 1111, and 1146 Hz, and another one stand­ing out at 3344 Hz. Again, a dip occurs between around 2000 and 3000 Hz. The band of 2900 to 4100 Hz is very present in the spec­trum. Again a slope shows at 5100 Hz decreas­ing from 20 dB to — 24 dB at 11000 Hz.


conclusions

room acoustics
No sig­nif­i­cant dif­fer­ences can be stated between the three mea­sure­ments. Remark­able is only the short rever­ber­a­tion time in the first example.

All fre­quency responses obtained are not flat and has a clear low-pass char­ac­ter­is­tic. Because of the speaker dis­tort­ing exactly in that range plus a rather flat response for the other lower fre­quen­cies, it is dif­fi­cult to tell which archi­tec­tural fea­ture may cause this effect most promi­nently. At least mea­sure­ment N° 2 implies that the ceil­ing and tables are reflect­ing sur­faces. One pecu­liar­ity can hardly be given any account for. It is still inex­plic­a­ble to the author why the longest rever­ber­a­tion time of all is at 3240 (wave­length 10 cm), 3500 (9.8 cm) and 3344 Hz (10.2 cm).*

* Remark (22 Feb­ru­ary 2012): The “inex­plic­a­ble … rever­ber­a­tion time[s]” might stem from about 60 wine glasses of four dif­fer­ent sizes which could work as a res­onator. They are hang­ing openly behind the bar, which can be partly seen on the pho­to­graph of the first measurement.

Apart from atten­u­at­ing low fre­quen­cies, in all three mea­sure­ments the given fea­tures of the room sup­press 2000 — 3000 Hz and favor a band between approx­i­mately 3000 and 4000 Hz.

With a total rever­ber­a­tion time between 41 ms (N°1), 41 ms (N°2) and 63 ms (N° 3) the bar appears to have a rather long rever­ber­a­tion time with the first two val­ues being just above Inti­macy level. The third value implies dif­fi­cul­ties for hav­ing a conversation.

This is good evi­dence to hold the first part of the hypoth­e­sis but repeat the mea­sure­ment with a bet­ter speaker.

impli­ca­tions for speech

Even though a fil­ter­ing effect was shown for low fre­quen­cies they do not get as much atten­u­ated as expected. This might be due to flawy hard­ware that was used to record. On the other hand, pos­si­bly, by staff in the kitchen low fre­quen­cies are expe­ri­enced to be louder than they are for guests because kitchen staff work directly under­neath one of the bar’s loud­speak­ers emit­ting music.

A proper mea­sure­ment of the speech trans­mis­sion index STI would have led to more reli­able results. The Signal-to-Noise-Ratio would have been an impor­tant mea­sure to draw valid con­clu­sions. Still, the trans­fer of the band 200 — 3600 Hz is not equal for all fre­quen­cies. Gen­er­ally speak­ing, the results sug­gest that fun­da­men­tal fre­quen­cies of speech are rel­a­tively well rep­re­sented by the room, whereas fre­quency con­tent of around 2000 — 3000 Hz get sup­pressed. The lat­ter leads to damp­ing in the upper for­mant struc­ture of cer­tain vow­els, as shown below.

source: http://www.phonetik.uni-muenchen.de/studium/skripten/SGL/V_FTab.jpg on Feb­ru­ary 21, 2012. Empha­sis mine.

Addi­tion­ally, when con­sid­er­ing the trans­fer for high fre­quen­cies one must state that steep slopes which appear at ~ 5000 Hz are cer­tain to influ­ence speech trans­mis­sion neg­a­tively. Even though the com­mon fre­quency band for tele­phony which was con­sid­ered here works fine for its pur­pose, it is not flaw­less. In tele­phony, espe­cially plo­sives with a lot of high fre­quency con­tent are not rep­re­sented well and can­not always be dis­tin­guished, which is prob­a­bly the case for my sys­tem likewise.

Hence, the impres­sion that speech becomes dif­fi­cult to under­stand can only be true for a cer­tain fre­quency con­tent, sup­press­ing only the band between 2000 Hz and 3000 Hz and above 5000 Hz.

Fre­quen­cies of approx­i­mately 3000 — 4000 Hz get sig­nif­i­cantly atten­u­ated in all three cases. Nev­er­the­less, it can­not be guar­an­teed that this is an actual char­ac­ter­is­tic of the room as it might have also been caused by the speaker’s mem­brane when it distorted.

In Room Acoustics the mea­sure of Deut­lichkeit (Brüder­lin, 1995), usu­ally referred to as Def­i­n­i­tion, describes the amount of sig­nal con­tent dur­ing the first 50 ms after exci­ta­tion of the sys­tem. Any reflec­tion occur­ring dur­ing that time is per­ceived as being part of the first wave­front and there­fore, atten­u­at­ing the sig­nal and not being  an echo. The results for my sys­tem seem sat­is­fac­tory — For the first two mea­sure­ments a Def­i­n­i­tion of 100 % is given, for the third one it is esti­mated to be more than 95 %.

Finally, the trans­mis­sion of speech is less dra­mat­i­cally impaired than expected and there­fore, the sec­ond part of the hypoth­e­sis has to be at least partly declined. Only fur­ther research could bring clarity.

Ref­er­ences

- Brüder­lin, R.: Akustik für Musiker. Kas­sel: Gus­tav Bosse Ver­lag. 3rd ed. 1995.
– Meyer, J.: Akustik und musikalis­che Auf­führung­spraxis. 5th ed. 2004.
– Tem­pelaars, S.: Sig­nals and Sys­tems. Gar­land Pub. 1996.

http://www.sengpielaudio.com/calculator-wavelength.htm last time on 21 Feb­ru­ary 2012.

http://kc.koncon.nl/staff/pabon/IRM/IRMeasurementInstruction/assignment_IR_ExpSweep_MainPage.htm and all linked sites last time on 21 Feb­ru­ary 2012.

http://www.phonetik.uni-muenchen.de/studium/skripten/SGL/V_FTab.jpg  last time on 21 Feb­ru­ary 2012.

http://maps.google.com/maps?q=Van+Oldenbarneveltstraat+139&hl=de&ll=51.918915,4.472637&spn=0.000931,0.001725&sll=37.0625,-95.677068&sspn=39.099308,56.513672&hq=Van+Oldenbarneveltstraat+139&radius=15000&t=h&z=19   last time on 21 Feb­ru­ary 2012