Testing data and train data are repeated in Shakespeare #19

zliangak · 2019-11-30T00:12:42Z

Using the splitting method provided in the paper_experiment, I found that the testing data appeals exactly in the training data, resulting in a 100% testing accuracy if you use SGD+2 layers LSTM to train it.

For example, in the training set, 'THE_FIRST_PART_OF_HENRY_THE_SIXTH_MORTIMER''s words are:
[..., g age, Let dying Mortimer here rest himself. Even like a man new haled from the ',
' age, Let dying Mortimer here rest himself. Even like a man new haled from the r',
'age, Let dying Mortimer here rest himself. Even like a man new haled from the ra',
'ge, Let dying Mortimer here rest himself. Even like a man new haled from the rac',
'e, Let dying Mortimer here rest himself. Even like a man new haled from the rack',
', Let dying Mortimer here rest himself. Even like a man new haled from the rack,',
'Let dying Mortimer here rest himself. Even like a man new haled from the rack, S',
'et dying Mortimer here rest himself. Even like a man new haled from the rack, So',
't dying Mortimer here rest himself. Even like a man new haled from the rack, So ',
' dying Mortimer here rest himself. Even like a man new haled from the rack, So f',
'dying Mortimer here rest himself. Even like a man new haled from the rack, So fa',
'ying Mortimer here rest himself. Even like a man new haled from the rack, So far',
'ing Mortimer here rest himself. Even like a man new haled from the rack, So fare',
'ng Mortimer here rest himself. Even like a man new haled from the rack, So fare ',
'g Mortimer here rest himself. Even like a man new haled from the rack, So fare m',...]

and in the testing set, you can find the exact santence:

[' Let dying Mortimer here rest himself. Even like a man new haled from the rack, ']

My model will get a testing accuracy about 1 in about 35 epochs.

scaldas · 2019-12-02T05:50:32Z

You are right. The split shouldn't be at random but temporal, and we should be careful to avoid this. We will work on fixing this.

Can you provide us with the parameters that you used to obtain the testing accuracy (learning rate, size of the layers, etc.)?

Thank you.

zliangak · 2019-12-02T06:51:37Z

I am using pytorch. All info is shown in the file below. One different of my model is that I feed all the hidden unit (instead of the last one) in to the linear layer. When I use your implementation of LSTM, i.e. only feeding the last hidden unit, I can still get a pretty high accuracy given enough training epochs.

Hope this can be fixed soon. It is a very helpful dataset. Thanks.

test.pdf

zliangak · 2019-12-04T08:49:43Z

Sorry, there is a mistake of my implementation in the cell-6 of "test.pdf". When I define test_loader, I should use dataset=test_set instead of dataset=train_set.

And this would not affect the existence of the problem regarding this issue.

Regards,

scaldas · 2019-12-05T19:13:24Z

Sorry, I'm confused by your update. Does this mean the issue remains?

zliangak · 2019-12-06T04:38:44Z

Yes, the issue remains.

scaldas · 2020-03-11T19:30:19Z

I have modified the train/test splits for Shakespeare. They are now temporally split, and samples that would leak any test information into the training set are ignored. This means that, if the last training sample happens at index i, the first test sample happens at index i + seq_len. We use seq_len as 80.

A side effect of this change is that some users now don't have any test samples, and have to be dropped from training.

zliangak · 2020-03-12T05:18:52Z

Cool! Thanks a lot!

scaldas closed this as completed in 94ff90c Mar 11, 2020

mehreentahir16 pushed a commit to mehreentahir16/leaf that referenced this issue Mar 8, 2024

fixes TalwalkarLab#19

ac83332

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing data and train data are repeated in Shakespeare #19

Testing data and train data are repeated in Shakespeare #19

zliangak commented Nov 30, 2019

scaldas commented Dec 2, 2019

zliangak commented Dec 2, 2019

zliangak commented Dec 4, 2019

scaldas commented Dec 5, 2019

zliangak commented Dec 6, 2019

scaldas commented Mar 11, 2020

zliangak commented Mar 12, 2020

Testing data and train data are repeated in Shakespeare #19

Testing data and train data are repeated in Shakespeare #19

Comments

zliangak commented Nov 30, 2019

scaldas commented Dec 2, 2019

zliangak commented Dec 2, 2019

zliangak commented Dec 4, 2019

scaldas commented Dec 5, 2019

zliangak commented Dec 6, 2019

scaldas commented Mar 11, 2020

zliangak commented Mar 12, 2020