0
$\begingroup$

I am working on this CSV file which is a collection of movie details from IMDB. In this I have a genres column in the dataframe with all the genres of the movies seperated by a pipe (|)
What I need is to extract the first two genres from the genres column and store them in two new columns: genre_1 and genre_2.
And for the columns where there is only 1 genre, extract the single genre into both the columns, i.e. for such movies the genre_2 will be the same as genre_1.

I am sharing the screen shots of the code and results that I have got. This is my data frame, movies and the genre column
This is what I have been trying

Now, I can create a new data frame with the genres created and can then remove the unwanted columns and can concatenate the remaining the with original data frame. But that looks pretty clumsy.
How can I crate split the column in my original data frame only and remove the unwanted expanded columns.
Any help is appreciated.

$\endgroup$

2 Answers 2

0
$\begingroup$

This is a programming question rather than a data science question.

You need to use apply with a lambda function. So if your DataFrame is called movies:

In apply you must add axis=1 which means that you apply the function to rows and not to columns.

def get_genre(row, genre_index): array_genres = row['genres'].split('|') if len(array_genres) == 0: return '' elif len(array_genres) == 1 and genre_index == 1: return array_genres[0] else: return array_genres[genre_index] movies['genre_1'] = movies.apply(lambda row: get_genre(row, 0), axis=1) movies['genre_2'] = movies.apply(lambda row: get_genre(row, 1), axis=1) 
$\endgroup$
0
$\begingroup$

Try:

# Create an example dataframe df = pd.DataFrame({"genres":["Fantasy|Sci-Fi", "Action|Adventure|Fantasy", "Thriller", "Action|Adventure|Thriller|bbv","Action","Action|Adventure|thriller"]}) # Get a dataframe with as many columns as there are genres df = df.genres.str.get_dummies(sep = "|") # Get the genres as values df = df.multiply(df.columns) # Rename the columns to have the genre id df.columns = ["genre_" + str(x) for x in range(len(df.columns))] 

Input:

enter image description here

Output:

enter image description here

EDIT:

you can simply use pandas assign method:

df.assign(genre1= df.genres.str.split("|", expand = True).iloc[:,:1], genre2 = df.genres.str.split("|", expand = True).iloc[:,1:2]) 

*Output:

enter image description here

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.