programing

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

batch 2023. 4. 1. 08:36

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

여러 JSON 객체를 하나의 DataFrame으로 읽고 있습니다.문제는 일부 열이 목록이라는 것입니다.또한 데이터가 너무 커서 인터넷에서 구할 수 있는 솔루션을 사용할 수 없습니다.매우 느리고 메모리 효율이 낮다.

데이터는 다음과 같습니다.

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

데이터 모양은 다음과 같습니다. (441079, 12)

원하는 출력은 다음과 같습니다.

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

편집: 중복으로 표시된 후 이 질문에서 여러 열을 효율적으로 폭발시키는 방법을 찾고 있었다는 점을 강조하고 싶습니다.따라서 승인된 답변은 매우 큰 데이터 세트에서 임의의 수의 열을 효율적으로 폭발시킬 수 있습니다.다른 질문에 대한 답변이 이루어지지 않은 것(그 때문에 솔루션을 테스트한 후 이 질문을 한 것입니다).

팬더 > = 0.25

모든 열에 동일한 수의 리스트가 있다고 가정하면 각 열을 호출할 수 있습니다.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

먼저 분해해서는 안 되는 모든 열을 인덱스로 설정한 다음 인덱스를 다시 설정하는 것이 좋습니다.

속도도 빨라요.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

사용하다set_index에A및 나머지 열apply그리고.stack가치관이 모든 것이 하나의 라이너로 응축되었다.

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

사용방법:

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

@cs95의 답변에 기초하여if의 조항lambda다른 모든 열을 로서 설정하는 대신, 함수가 됩니다.index이 방법에는 다음과 같은 이점이 있습니다.

열 순서 유지
수정할 세트를 사용하여 열을 쉽게 지정할 수 있습니다.x.name in [...]또는 수정하지 않음x.name not in [...].

df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8

판다 1.3.0 기준 (1.3.0(2021년 7월 2일)의 신기능):

DataFrame.explode() 이제 는 여러 컬럼의 폭발을 지원합니다.column 인수에서는 동시에 여러 컬럼에서 폭발하기 위한 str 또는 tuples 목록도 허용됩니다(GH39240).

이 조작은 다음과 같이 간단합니다.

df.explode(['B', 'C', 'D', 'E'])

    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

또는 고유한 인덱싱을 원하는 경우:

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

이 스레드 및 기타 스레드에 대한 모든 응답을 수집하는 방법은 다음과 같습니다.

from collections.abc import Sequence
import pandas as pd
import numpy as np


def explode_by_delimiter(
    df: pd.DataFrame,
    columns: str | Sequence[str],
    delimiter: str = ",",
    reindex: bool = True
) -> pd.DataFrame:
    """Convert dataframe with columns separated by a delimiter into an
    ordinary dataframe. Requires pandas 1.3.0+."""
    if isinstance(columns, str):
        columns = [columns]

    col_dict = {
        col: df[col]
        .str.split(delimiter)
        # Without .fillna(), .explode() will fail on empty values
        .fillna({i: [np.nan] for i in df.index})
        for col in columns
    }
    df = df.assign(**col_dict).explode(columns)
    return df.reset_index(drop=True) if reindex else df

'적용' 기능을 사용한 솔루션입니다.주요 기능 / 차이점:

선택한 다중 열 또는 모든 열을 지정하는 제공 옵션
옵션 제공 - 'fill_mode' 위치를 채울 값을 지정합니다(매개변수 fill_mode = 'internal', 또는 'internal'을 통해 설명은 길어집니다. 아래 예를 참조하고 옵션을 변경하고 결과를 확인하십시오.

주의: 옵션 '트림'은 필요에 따라 개발되었으며, 이 질문의 범위를 벗어났습니다.

    def lenx(x):
        return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1

    def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
        jcols = [j for j,v in enumerate(row.index) if v in cols]
        if len(jcols)<1:
            jcols = range(len(row.index))
        Ls = [lenx(x) for x in row.values]
        if not Ls[:-1]==Ls[1:]:
            vals = [v if isinstance(v,list) else [v] for v in row.values]
            if fill_mode=='external':
                vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e + [fill_value]*(max(Ls)-lenx(e))
                        for j,e in enumerate(vals)]
            elif fill_mode == 'internal':
                vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                        else e+[e[-1]]*(max(Ls)-lenx(e)) 
                        for j,e in enumerate(vals)]
            else:
                vals = [e[0:min(Ls)] for e in vals]
            row = pd.Series(vals,index=row.index.tolist())
        return row

Examples:

    df=pd.DataFrame({
        'a':[[1],2,3],
        'b':[[4,5,7],[5,4],4],
        'c':[[4,5],5,[6]]
    })
    print(df)
    df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
    df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
    df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
    df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
    print('\nfill_mode=\'trim\', all columns\n', df4)

Output:

         a          b       c
    0  [1]  [4, 5, 7]  [4, 5]
    1    2     [5, 4]       5
    2    3          4     [6]
    
    fill_mode='external', all columns, fill_value = 'OK'
         a  b   c
    0   1  4   4
    0  OK  5   5
    0  OK  7  OK
    1   2  5   5
    1  OK  4  OK
    2   3  4   6
    
    fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
         a  b       c
    0   1  4  [4, 5]
    0  OK  5      OK
    0  OK  7      OK
    1   2  5       5
    1  OK  4      OK
    2   3  4       6
    
    fill_mode='internal', cols = ['a', 'b']
        a  b       c
    0  1  4  [4, 5]
    0  1  5  [4, 5]
    0  1  7  [4, 5]
    1  2  5       5
    1  2  4       5
    2  3  4       6
    
    fill_mode='trim', all columns
        a  b  c
    0  1  4  4
    1  2  5  5
    2  3  4  6

언급URL : https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe

'programing' 카테고리의 다른 글

각도 컨트롤러에서 'var vm = this;'는 무엇을 의미합니까? (0)	2023.04.01
Scala / Lift에서 JSON 문자열을 작성 및 해석하려면 어떻게 해야 합니까? (0)	2023.04.01
Spring 5 WebClient 통화 기록 방법 (0)	2023.04.01
onChange 수신기를 사용해도 반응에서 입력 값을 변경할 수 없는 이유 (0)	2023.04.01
mac의 mongodb 데이터베이스 위치 (0)	2023.04.01

현재글팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

각종 프로그래밍 정보를 다루는 블로그입니다.

ReactJS, json, ASP.NET, mongoDB, jquery, Excel, angularjs, Oracle, C, Python, Wordpress, ajax, android, spring-boot, Git, SQL-Server, typescript, PowerShell, mariadb, bash,

Today :
Yesterday :

batch

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

팬더 > = 0.25

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

팬더 DataFrame에서 여러 목록 열을 효율적으로 제거(해독)하는 방법

팬더 > = 0.25

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바