22551

Pandas, concatenate certain columns if other columns are empty

I've got a CSV file that is supposed to look like this:

ID, years_active, issues ------------------------------- 'Truck1', 8, 'In dire need of a paintjob' 'Car 5', 3, 'To small for large groups' 

However, the CSV is somewhat malformed and currently looks like this.

ID, years_active, issues ------------------------ 'Truck1', 8, 'In dire need' '','', 'of a' '','', 'paintjob' 'Car 5', 3, 'To small for' '', '', 'large groups' 

Now, I am able to identify faulty rows by the lack of an 'ID' and 'years_active' value and would like to append the value of 'issues of that row to the last preceding row that had 'ID' and 'years_active' values.

I am not very experienced with pandas, but came up with the following code:

for index, row in df.iterrows(): if row['years_active'] == None: df.loc[index-1]['issues'] += row['issues'] 

Yet - the IF condition fails to trigger. Is the thing I am trying to do possible? And if so, does anyone have an idea what I am doing wrong?

df = pd.DataFrame({ 'ID': ['Truck1', '', '', 'Car 5', ''], 'years_active': [8, '', '', 3, ''], 'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups'] }) 

You can use:

new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join}) 

Which'll give you:

 years_active issues ID Car 5 3 To small for large groups Truck1 8 In dire need of a paintjob 

So what we're doing here is forward filling the non-blank IDs into subsequent blank IDs and using those to group the related rows. We then aggregate to take the first occurrence of the years_active and join together the issues columns in the order they appear to create a single result.

Following uses a for loop to find and add strings (dataframe from JonClements' answer):

df = pd.DataFrame({ 'ID': ['Truck1', '', '', 'Car 5', ''], 'years_active': [8, '', '', 3, ''], 'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups'] }) ss = ""; ii = 0; ilist = [0] for i in range(len(df.index)): if i>0 and df.ID[i] != "": df.issues[ii] = ss ss = df.issues[i] ii = i ilist.append(ii) else: ss += ' '+df.issues[i] df.issues[ii] = ss df = df.iloc[ilist] print(df) 

Output:

 ID issues years_active 0 Truck1 In dire need of a paintjob 8 3 Car 5 To small for large groups 3 

It might be worth mentioning in the context of this question that there is an often overlooked way of processing awkward input by using the StringIO library.

The essential point is that read_csv can read from a StringIO 'file'.

In this case, I arrange to discard single quotes and multiple commas that would confuse read_csv, and I append the second and subsequent lines of input to the first line, to form complete, conventional csv lines form read_csv.

Here is what read_csv receives.

 ID years_active issues 0 Truck1 8 In dire need of a paintjob 1 Car 5 3 To small for large groups 

The code is ugly but easy to follow.

import pandas as pd from io import StringIO for_pd = StringIO() with open('jasper.txt') as jasper: print (jasper.readline(), file=for_pd) line = jasper.readline() complete_record = '' for line in jasper: line = ''.join(line.rstrip().replace(', ', ',').replace("'", '')) if line.startswith(','): complete_record += line.replace(',,', ',').replace(',', ' ') else: if complete_record: print (complete_record, file=for_pd) complete_record = line if complete_record: print (complete_record, file=for_pd) for_pd.seek(0) df = pd.read_csv(for_pd) print (df)