The TikTok data team intends to develop a machine learning model capable of classifying of claims for user submissions. To start, the team needs to organize the dataset and prepare for exploratory data analysis.
A preliminary investigation of the dataset revealed some key relationships between variables in the data. Claims and opinions were both investigated to better understand each respective type of content.
There were two variables discovered to be of considerable importance. Namely, video_duration and video_view_count should be considered for inclusion into a prospective machine learning model.
A careful review of the data showed the variable claim_status was useful for this project. We were able to gather the respective counts of the different claim statuses as outlined below:
data['claim_status'].value_counts()
claim 9608
opinion 9476
Viewer engagement on the platform had a correlation to the claim status variable. Mean and median values of the respective categories help to illustrate a key relationship that exists within the data.
Claims:
Mean view count claims: 501029.4527477102
Median view count claims: 501555.0
Opinions:
Mean view count opinions: 4956.43224989447
Median view count opinions: 4953.0