Hi Lauralicja, In Spotify's audio analysis, pitch refers to the perceived fundamental frequency of a sound, and is measured in Hertz (Hz). Timbre refers to the tone quality or texture of a sound, which is a complex concept that is difficult to quantify in absolute terms.
In the segments section of the audio analysis, the pitches array contains a series of pitch values, one for each 512-sample frame of the segment. The first value in the array corresponds to the pitch at the start of the segment, and subsequent values correspond to pitch changes over time.
The timbre array contains a series of 12-dimensional timbre feature vectors, one for each 512-sample frame of the segment. The first vector in the array corresponds to the timbre at the start of the segment, and subsequent vectors correspond to changes in timbre over time.
The pitch values in the pitches array are represented as decimal values between 0 and 1, with 1 representing the highest possible pitch (approximately 12,000 Hz). The mapping of pitch values to actual musical notes depends on the tuning system being used, but generally speaking, each octave is divided into 12 semitones, with each semitone corresponding to a specific pitch ratio (e.g. a semitone is a ratio of 2^(1/12) or approximately 1.0595). So, for example, a pitch value of 0.5 might correspond to the note A4 (440 Hz) in one tuning system, or Bb4 (466.16 Hz) in another.
The timbre feature vectors are more difficult to interpret in absolute terms, as they represent complex combinations of spectral and temporal features that contribute to the perceived tone quality of the sound. However, these feature vectors can be used to train machine learning models that can classify sounds based on their timbral characteristics.
Sorry... Kind of threw a lot at you lol I hope this helps! Let me know if you have any further questions.
Highest regards,
-Prague the Dog