Reading only the words of a specific speaker and adding those words to a list

I have a transcript and in order to perform an analysis of each speaker I need to only add their words to a string. The problem I'm having is that each line does not start with the speakers name. Here's a snippet of my text file

BOB: blah blah blah blah blah hello goodbye etc. JERRY:............................................. ............... BOB:blah blah blah blah blah blah blah.

I want to collect only the words from the chosen speaker(in this case bob) said and add them to a string and exclude words from jerry and other speakers. Any ideas for this?

edit:There are line breaks between paragraphs and before any new speaker starts.

Answer1:

Every time a speaker starts to speak, keep the current_speaker and decide what to do according to this speaker. Read the lines until the speaker changes.

Answer2:

Using a regex is the best way to go. As you'll be using it multiple times, you can save on a bit of processing by compiling it before using it to match each line.

import re speaker_words = {} speaker_pattern = re.compile(r'^(\w+?):(.*)$') with open("transcript.txt", "r") as f: lines = f.readlines() current_speaker = None for line in lines: line = line.strip() match = speaker_pattern.match(line) if match is not None: current_speaker = match.group(1) line = match.group(2).strip() if current_speaker not in speaker_words.keys(): speaker_words[current_speaker] = [] if current_speaker: # you may want to do some sort of punctuation filtering too words = [word.strip() for word in line.split(' ') if len(word.strip()) > 0] speaker_words[current_speaker].extend(words) print speaker_words

This outputs the following:

{ "BOB": ['blah', 'blah', 'blah', 'blah', 'blah', 'hello', 'goodbye', 'etc.', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah.'], "JERRY": ['.............................................', '...............'] }

人吐槽 人点赞

Recommend

Comment

用户名: 密码:
验证码: 匿名发表

你可以使用这些语言

查看评论:Reading only the words of a specific speaker and adding those words to a list