首先,我知道有关此问题的答案,但是到目前为止,它们都没有为我工作。无论如何,尽管我已经使用了该解决方案,但我想知道您的答案。
我有一个名为mbti_datasets.csv
的csv文件。第一列的标签为type
,第二列的标签为description
。每行代表一种新的人格类型(及其各自的类型和描述)。
TYPE | DESCRIPTION
a | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
b | b.description
c | c.description
d | d.description
...16 types | ...
在下面的代码中,当描述具有\n
时,我试图复制每个个性类型。
代码:
import pandas as pd
# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])
# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()
# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)
# give it a name to join it to the DataFrame
serie.name = 'description'
# remove original column
del df['description']
# join the series with the DataFrame, based on the shared index
df = df.join(serie)
# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'
df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)
print(new_df)
预期输出:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...
a | In fact, are strong people...
b | b.description
b | b.description
c | c.description
... | ...
当前输出:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...NaN
a | NaN
a | In fact, are strong people...NaN
b | b.description...NaN
b | NaN
b | b.description
c | c.description
... | ...
我不确定100%,但是我认为NaN值为\r
。
根据要求上传到github的文件: CSV FILES
使用@YOLO解决方案: CSV YOLO FILE 例如。哪里失败了:
2 INTJ Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ -- y las mujeres # adds -- in the beginning
3 INTJ (...) el 0--8-- de la poblaci # doesnt end the word 'población'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter
为了全面理解而进行的翻译:
2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter
当我显示是否存在任何NaN值以及哪种类型时:
print(new_df['descripcion'].isnull())
<class 'float'>
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
11 True
continue...
答案 0 :(得分:2)
这是一种解决方法,我必须找到一种替代方法来替换\n
字符,以某种方式无法直接使用它:
df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')
df = df.explode('DESCRIPTION')
print(df)
TYPE DESCRIPTION
0 a This personality likes to eat apples...
0 a They look like monkeys...
0 a In fact-- are strong people...
1 b b.description
2 c c.description
3 d d.description
答案 1 :(得分:0)
该问题可以归因于描述单元格,因为有些部分有两条新的连续行,而它们之间没有任何内容。
我只是使用;with t1 as ( -- Exclude customer comeback in the same date
select distinct userid, date
from #table1
),
t2 as (-- Get week in year
select userid, 'Week ' + cast(datepart(wk, date) as varchar(2)) Week
from t1
)
select userid, Week, count(*) as numberOfVisit -- group by userId and week in year
from t2
group by userid, Week
having count(*) > 1
来读取创建的新csv,并在不使用NaN值的情况下重写它。无论如何,我认为重复此过程并不是最好的方法,但它可以直接作为解决方案。
;with t1 as (
select distinct userid, date
from #table1
),
t2 as (
select userid, 'Week ' + cast(datepart(wk, date) as varchar(2)) Week
from t1
),
t3 as (
select userid, Week, count(*) as numberOfVisit
from t2
group by userid, Week
having count(*) > 1)
select count(*) Total
from t3